googlenews-vectors-negative300
Stock Market Data
#Charlottesville
40-year AVHRR record of visible channel Rrs and coccolithophorid blooms, links to netCDF files
datasets.zip
RAW files
X-ray crystallography data
Figure S2
Ferndale Bog loss-on-ignition dataset
NN training and testing
Air_Quality
Google-Landmarks Dataset
Biodiversity in National Parks
New York Times Best Sellers
VDiscovery
Geothermal Geodatabase for Routt Hot Springs, Routt County, Colorado
International airline passengers
CLAAS-2: CM SAF CLoud property dAtAset using SEVIRI - Edition 2
Datasets for outlier detection
cell differentiation tree
Appendix S2
Images of Lego Bricks
Single locus analysis
Excel Dataset
Table_S2
Clustering analysis of microarray data
Radiocarbon in CO2 and Soil Organic Matter from Laboratory Incubations, Barrow, Alaska, 2014
USDA plant database
GloVe 6B
Russian Troll Tweets
Genomic datasets
Figure3A
Chlorophyll a and Chlorophyll c
all-words
Salivary sTREM-1 and PGLYRP-1
PCA dataset
Intraday-Data
Solar System Features
Powerlifting Database
Amazon Fine Food Reviews
World Cities
Spatial distribution of a flying seabird (Antarctic petrel) and penguins (Adélie penguin, Emperor penguin) in the wider Weddell Sea (Antarctica) with links to ArcGIS map packages
Raw MSE data
Cycling Metrics
Human proteins interactions
Interactive Locomotion
OFC data
Data.xlsx
Adaptive Incremental Mixture Markov Chain Monte Carlo
Yelp 2015
Star Cluster Simulations
Breast Histopathology Images
Crypto Currencies
Additional File 4:
RuStance
Chicago Crime
Strong Rotational Anisotropies Affect Nonlinear Chiral Metamaterials
What we are trying to do
New draft item
Fig 3a.
sales of shampoo
Python-scripts
Rooftop Energy Potential of Low Income Communities in America REPLICA
Supplementary Table 9
Empathic accuracy
Bollywood Movie Dataset
The Movies Dataset
synthetic_dataset
ACTINN
THzSecurityImageDataset
BOLD5000
horses for courses
Flickr Image dataset
15/2 pollen surface sample dataset
The Social Life of Data
What people purchase
Metadata record for: Reference gene and small RNA data from multiple tissues of Davidia involucrata Baill
THz Security Image Dataset
Orthologous groups
Seattle Pet Licenses
100 phylogenies
High resolution global grids of revised Priestley-Taylor and Hargreaves-Samani coefficients for assessing ASCE-standardized reference crop evapotranspiration and solar radiation, links to ESRI-grid files, supplement to: Aschonitis, Vassilis G; Papamichail, Dimitris; Demertzi, Kleoniki; Colombani, Nicolo; Mastrocicco, Micol; Ghirardini, Andrea; Castaldelli, Giuseppe; Fano, Elisa-Anna (2017): High-resolution global grids of revised Priestley-Taylor and Hargreaves-Samani coefficients for assessing ASCE-standardized reference crop evapotranspiration and solar radiation. Earth System Science Data, 9(2), 615-638
Disk Space Data
Wikidata Property Ranking', 'Relevance judgments for properties of 350 Wikidata entities
Study 1 stimuli
How many samples are needed to prove the absence of contamination - an example using arsenic?
G-03
dataset 4
Understanding of Pain
NTU Dataset
JournalInformation
Queries: DBpedia
New York City population
Traffic accident severity
sex-classification
PatientDiagnosis
Supplemental Table 6
CO2 and CH4 Production and CH4 Oxidation in Low Temperature Soil Incubations from Flat- and High-Centered Polygons, Barrow, Alaska, 2012
Snow Cover Fraction (SCF) and snow depth obtained using terrestrial photography (2009-2013) in the control area Refugio Poqueira (Sierra Nevada, Spain), supplement to: Pimentel, Rafael; Herrero, Javier; Polo, María José (2017): Subgrid parameterization of snow distribution at a Mediterranean site using terrestrial photography. Hydrology and Earth System Sciences, 21(2), 805-820
New draft item
New draft item
MNIST Digit Recognition
Computational Imaging
Relevance assessment
Optimized implementations of voxel-wise degree centrality and local functional connectivity density mapping in AFNI
The Genomes Mapserver
Phylogenomic Supermatrices
15/1 pollen surface sample dataset
Incidence dataset
Student Feedback Dataset
Chat 80
smsspamcollection
The E2E Challenge Dataset
Speed Dating Experiment
Minimal dataset
Creating Customer Segments
Model output data
Linked Data Platforms
GPCP Version 2.2 Combined Precipitation Data Set
Captcha Images
Spam Text Message Classification
Data for both species
raw mapping data
IMDB Movie Review
Jester Collaborative Filtering Dataset
YouTube Comedy Slam
UCI Cardiotocography
person.csv
Smart Home Scenarios
Example dataset
Baby data
Micro-Loans
Hospital Charges for Inpatients
Supplementary Data 1
FocaLens
Medium Articles
Bank Marketing Dataset
Who starts and who debunks rumors', 'Webpages cited by rumor trackers
Style Color Images
thyroid CT images
No Data Sources
train.csv
Improved estimate of global gross primary production for reproducing its long-term variation, 1982-2017
Richness in ecosystem services
Data discovery and re-use
Query reformulation
chromosome number polymorphism
ISCO-08
A dataset of 30-meter annual vegetation phenology indicators (1985-2015) in urban areas of the conterminous United States
The Million Song Dataset
Dataset Descriptions
Minutiae Sample
Arabic Handwritten Characters Dataset
AnimalDataset
Blood donation in Brazil
An example of PSCP files in which multiple microarray datasets were analyzed simultaneously
Malicious and Benign Websites
wsdream
Additional File 10:
Meteorological dataset
GoodReads Dataset
Transect database
Fungi Dataset
Movie Industry
Question Answering Data
LSDO
Prospective Harmonization
concatenated mtDNA dataset
GloVe: Global Vectors for Word Representation
Data3
Spearman correlation coefficient matrix
Global Fuelbed Dataset
Black Friday
Failure dataset
Employee Attrition
EmBeD data
Behavioral Risk Factor Surveillance System
MNIST FASHION
Thermal soccer dataset
Supplementary Data 7
Medical Appointment
Mortality Curves
SMS dataset
Lahman Baseball Database
Piezoelectric Tensor Data
DueCredit: automated collection of citations for software, methods, and data
Spearman correlation coefficient analysis
Japanese-English Bilingual Corpus
Supplemental Table 5
Office Supply Sales
Dataset I
CelebA resized
Using CALIOP to estimate cloud-field base height and its uncertainty: the Cloud Base Altitude Spatial Extrapolator (CBASE) algorithm and dataset
LORIS: DICOM anonymizer
population genetic dataset
Our overall approach
Electronic supplementary material, Dataset S2
Developmental trajectories in brain development
IUGR dataset
WOCE-Argo Global Hydrographic Climatology (WAGHC Version 1.0)
2009 data
Visualization 2
Supraglacial Debris Cover
Two loci analysis
New draft item
Kaggle Datasets
Indirect Food Additives
Venues in New York City
HR 114 pollen surface sample dataset
A global gridded data set on tillage
Air Passengers
Movie Dataset
Fruits 360 dataset
New draft item
Geology datasets in North America, Greenland and surrounding areas for use with ice sheet models, supplement to: Gowan, Evan J; Niu, Lu; Knorr, Gregor; Lohmann, Gerrit (2019): Geology datasets in North America, Greenland and surrounding areas for use with ice sheet models. Earth System Science Data, 11(1), 375-391
R-code and datasets
Scottish Borders Roads
Indian Pines Hyperspectral Dataset
Suicides in India
Horse Racing Data
Museum of Modern Art Collection
Snow cover data across Nordenskiöldland, Svalbard, from point measurements during 2014-2016, supplement to: Möller, Marco; Möller, Rebecca (in review): Snow cover variability across glaciers in Nordenskiöldland (Svalbard) from point measurements in 2014–2016. Earth System Science Data Discussions, 1-16
Concatenated multigene aignments
BASE calculations, supplement to: Searchinger, Tim D; Wirsenius, Stefan; Beringer, Tim; Dumas, Patrice (2018): Assessing the efficiency of changes in land use for mitigating climate change. Nature, 564(7735), 249-253
The transcriptome assembly
Journal Data
Shifted cumulative regulation
dataset from TCGA
Additional File 2:
Names Corpus
Metadata record for: CommonMind Consortium provides transcriptomic and epigenomic data for Schizophrenia and Bipolar Disorder
Searches and filters
Soil hydrological data
GWAS summary statistics
Supplementary table 2
Random Forest Code
USArrests
Yelp 2013
PSets
Dielectric Constant Data
Steady state simulation and operation planning of integrated energy supply systems
RumourEval 2019 data
Non-synchronized HeLa cells
Twitter Threads
Crime in Atlanta
ENEM 2015
Lung Nodule Malignancy
nba draft
Company Reviews
Dataset Summary
Bug Triaging
Prediction of Pathological Stage in Patients with Prostate Cancer: A Neuro-Fuzzy Model
New York Stock Exchange
Cat and Dog
Nobel Prize winners
China Merged Surface Temperature, supplement to: Yun, Xiang; Huang, Boyin; Cheng, Jiayi; Xu, Wenhui; Qiao, Shaobo; Li, Qingxiang (in review): A new merge of global surface temperature datasets since the start of the 20th Century. Earth System Science Data Discussions, 1-44
NOAA-CIRES-DOE Twentieth Century Reanalysis Version 3
Figurnoye Lake pollen dataset
World Values Survey
Latitudinal gradient in seed dispersal distance
MIT-BIH Arrhythmia Database
Crimes in Boston
Yelp 2014
RASH evaluation
New draft item
Aretha Franklin
Reconciliation of quantum local master equations with thermodynamics
A cortical surface-based geodesic distance package for Python
DataA1
NCEP ADP ETA / NAM Upper Air Observation Subsets
The International Surface Pressure Databank version 3
loghub
Physical Characteristics of Comets
Geostrophic Currents in the northern Nordic Seas - A Combined Dataset of Multi-Mission Satellite Altimetry and Ocean Modeling (data), supplement to: Müller, Felix L; Dettmering, Denise; Wekerle, Claudia; Schwatke, Christian; Bosch, Wolfgang; Seitz, Florian (in review): Geostrophic Currents in the northern Nordic Seas - A Combined Dataset of Multi-Mission Satellite Altimetry and Ocean Modeling. Earth System Science Data Discussions
Transplant experiment
Accounting Network
female survival
Molluscan Shell Matrix Proteins
Yelp Dataset', 'A trove of reviews, businesses, users, tips, and check-in data!
Supplementary Figure 2
Toothaker Pond pollen surface sample dataset
Dataset for "The effect of acute hypohydration on glycemia in healthy adults"
The simulated sEMG signals
Water Analyte Concentrations
Global Shark Attack File
Carsales
Solar and Lunar Eclipses
Imagenet32
Study 3 data
Metadata record for: The sequence and de novo assembly of Takifugu bimaculatus genome using PacBio and Hi-C technologies
Fisherman Lake pollen surface sample dataset
20 Newsgroups
R Codes
Post-Operative Patient Data Set
Sets of omnidirectional images
ABC dataset
Figure S4
Image Dataset for Object Recognition
Supplemental Dataset 1
R data file
Aufeis (naleds) of the North-East of Russia: GIS catalogue for the Indigirka River basin, supplement to: Makarieva, Olga; Shikhov, Andrey; Nesterova, Nataliia; Ostashov, Andrey (2019): Historical and recent aufeis in the Indigirka River basin (Russia). Earth System Science Data, 11(1), 409-420
Appendix S1
Lower Back Pain Symptoms Dataset
Bike Sharing
Democrat Vs. Republican Tweets', '200 tweets of Dems and Reps
GTZAN music/speech collection
Boston housing dataset
Pima Indians Diabetes Database', 'Predict the onset of diabetes based on diagnostic measures
Caricature Image
Demand Dataset
rawdata
M3500
Supplementary Table 1-2
Multi-source global wetland maps combining surface water imagery and groundwater constraints
IMDB data
GRUN : Global Runoff Reconstruction
Handwritten Names
Bitcoin Dataset
Wikipedia Article Titles
GRACE-REC: A reconstruction of climate-driven water storage changes over the last century
Motifs Data
SPEECH-COCO
Chiangsaen
Metadata record for: De novo transcriptome assembly and analysis of the freshwater araphid diatom Fragilaria radians, Lake Baikal
ArXiV Archive
Adult income dataset
BBC News Summary
Ships in Satellite Imagery
A global database of radiogenic Nd and Sr isotopes in marine and terrestrial samples (V. 2.0)
A novel nanozyme assay utilising the catalytic activity of silver nanoparticles and SERRS
ACL Accepted Papers
Microsatellites dataset
Multivariate alternating decision tree
80 Cereals
Student Marks
Input and Output Files
Community phylogeny
Analysis of complete dataset
Data.xlsx
Air quality Dataset
Histogram Inputs
Uncompressed version
NFL Statistics
FMC dataset
Reconciliation Vocabulary
Olivetti Faces
10Knots
Retail Sales Forecasting
Study 1 data
Melbourne Housing
Supplemental Dataset 8
Cryptocurrency Historical Prices
datasets.tar.gz
Dataset2
Finding and Measuring Lungs in CT Data
train6
Batch Effects Correction with Unknown Subtypes
Time series of streamflow occurrence from 182 sites in ephemeral, intermittent and perennial streams in the Attert catchment, Luxembourg
Readme.txt
Iris dataset
A schematic view of the procedure
bank-marketing
Additional File 1: Table S1.
ENEM 2016
movie lens
Color Rendition Characteristics
Multidimensional Poverty Measures
testcar
Wisconsin Breast Cancer Dataset
Pima Indian Diabetes Data
InteractiveSegmentation
Facial keypoints
African Elephants
GPCP Version 2.3 Monthly Analysis Product
BRFSS 2001-2010
Data Sharing, Distribution and Updating Using Social Coding Community Github and LaTeX Packages in Graduate Research
Metadata record for: Longitudinal dataset of human-building interactions in U.S. offices
Observations of sea turtles
Supplemental Dataset
Bibliographic Dataset
Data4
Reference in the dataset
Boston Housing
Visualization 3
Lecture capture survey
Social Network Ads
ChinaCropPhen1km: A high-resolution crop phenological dataset for three staple crops in China during 2000-2015 based on LAI products
Mitochondrial DNA sequences
Allele files
Analyzing data from the digital healthcare exchange platform for surveillance of antibiotic prescriptions in primary care in urban Kenya: a mixed-methods study
Avocado Prices
Metadata record for: A dataset of cetacean occurrences in the Eastern North Atlantic
LiveStreaming
Frames from video
The phenotype gap
Introduction to Machine Learning
Dataset S1
Hay Lake pollen dataset
Deriving Canada-wide soils dataset for use in Soil and Water Assessment Tool (SWAT), supplement to: Cordeiro, Marcos R C; Lelyk, Glenn; Kröbel, Roland; Legesse, Getahun; Faramarzi, Monireh; Masud, Mohammad Badrul; McAllister, Tim (2018): Deriving a dataset for agriculturally relevant soils from the Soil Landscapes of Canada (SLC) database for use in Soil and Water Assessment Tool (SWAT) simulations. Earth System Science Data, 10(3), 1673-1686
European Soccer Database
Overwatch
PPG-BP Database
Submerged sand deposits data from Western Sardinia, Mediterranean Sea organised in an interoperable Spatial Data Infrastructure, supplement to: Brambilla, Walter; Conforti, Alessandro; Simeone, Simone; Carrara, Paola; Lanucara, Simone; De Falco, Giovanni (2019): Data set of submerged sand deposits organised in an interoperable spatial data infrastructure (Western Sardinia, Mediterranean Sea). Earth System Science Data, 11(2), 515-527
JS Database
Wikipedia Edits
Bitcoin Historical Data
London Crime
LinkedIn Profile Data
Global Causes of Mortality
Elastic Tensor Data
Metadata record for: Temporary dense seismic network during the 2016 Central Italy seismic emergency for microzonation studies
New draft item
International Financial Statistics
Fashion Mnist
Supplementary table 5
The Global Energy Balance Archive (GEBA) version 2017: A database for worldwide measured surface energy fluxes. Link to database files, supplement to: Wild, Martin; Ohmura, Atsumu; Schär, Christoph; Müller, Guido; Folini, Doris; Schwarz, Matthias; Hakuba, Maria Z; Sanchez-Lorenzo, Arturo (2017): The Global Energy Balance Archive (GEBA) version 2017: a database for worldwide measured surface energy fluxes. Earth System Science Data, 9(2), 601-613
Electronic supplementary material, Dataset S3
iris.csv
Quotes Dataset
Online Job Postings
Missing People
Nigeria dishes
Top 100 2017
Users Data
Datasets used in this study
Snow cover maps (C1) of Guadalfeo Monitoring Network (Sierra Nevada, Spain), supplement to: Polo, María José; Herrero, Javier; Pimentel, Rafael; Pérez-Palazón, María José (2019): The Guadalfeo Monitoring Network (Sierra Nevada, Spain): 14 years of measurements to understand the complexity of snow dynamics in semiarid regions. Earth System Science Data, 11(1), 393-407
Predict Angina
Floating Island Lake pollen dataset
FROM-GLC-Hierarchy
Raw Data.xlsx
Video Games Review
Intel Xeon Scalable Processors
Fish Relatedness
Seattle Office for Civil Rights
Megapool
RAW_DATA
Loans data
Growth characteristics of Dahurian larch (Larix gmelinii) in northeast China during 1965-2015, supplement to: Jia, Bingrui; Zhou, Guangsheng (2018): Growth characteristics of natural and planted Dahurian larch in northeast China. Earth System Science Data, 10(2), 893-898
Data for: "A synthetic map of the northwest European Shelf sedimentary environment for applications in marine science"
Properties of PPI networks
Keras Models
Generating music with resting-state fMRI data
Pavia University Hyperspectral dataset
Crowding and Subitizing
ERA5 Reanalysis
ESM file
Game of Thrones
Iris Data Set
SNP datasets
Fecal hormones
sift data
Modern dataset
Metadata record for: De novo transcriptomes of 14 gammarid individuals for proteogenomic analysis of seven taxonomic groups
Glacier inventory of Pamir and Karakoram, link to GIS files, supplement to: Mölg, Nico; Bolch, Tobias; Rastner, Philipp; Strozzi, Tazio; Paul, Frank (2018): A consistent glacier inventory for Karakoram and Pamir derived from Landsat data: distribution of debris cover and mapping challenges. Earth System Science Data, 10(4), 1807-1827
Visualization 4
Hotel review
Daily temperature data from the Foothills Climate Array Mesonet, Canadian Rocky Mountains, 2005-2010, supplement to: Wood, Wendy H; Marshall, Shawn J; Fargey, Shannon E; Whitehead, Terri L (2018): Daily temperature records from a mesonet in the foothills of the Canadian Rocky Mountains, 2005-2010. Earth System Science Data, 10(1), 595-607
JSON File
Membrane feeding assays
morphological_data
heterozygosity-fitness
Pedestrian Dataset
Data for 27 countries
Geothermal Geodatabase for Rico Hot Springs Area and Lemon Hot Springs, Dolores and San Miguel Counties, Colorado
Bedroom air temperatures
A national dataset of annual urban extent (1985-2015) in the conterminous United States using Landsat time series data
Metadata record for: The sequencing and de novo assembly of the Larimichthys crocea genome using PacBio and Hi-C technologies
Football Events
Internal Cases
Bag of Words Meets Bags of Popcorn
Dataset and code (Matlab) for recoloring images
Arabic Natural Audio Dataset
Small mammal dataset
Data Table B
20 Newsgroups
Supplementary Table 7
Visualization 1
IRLA-CL
Simulation dataset
Indian License Plates
The exon diversity
RAxML Concatenated
Predicting a Biological Response
Twitter sentiment analysis
Lake catchment
A gridded dataset of belowground autotrophic respiration from 1980 to 2012 in global terrestrial ecosystems upscaling of observations
ESI-FTICR-MS Molecular Characterization of DOM Degradation under Warming in Tundra Soils from Barrow, Alaska
Microsatellite genotype data
web log dataset
Data file S1
Job Recommendation
Chinese Characters Generator
Toxic Words
Predicting Movie Revenue
PM in Kunming
Dataset II
Metadata record for: Time series of heat demand and heat pump efficiency for energy system modeling
input_data
US Mass Shootings
Video Game Sales
Breast Histology Images
Baseline surface radiation data (1992-2017), supplement to: Driemel, Amelie; Augustine, John; Behrens, Klaus; Colle, Sergio; Cox, Christopher J; Cuevas-Agulló, Emilio; Denn, Fred M; Duprat, Thierry; Dutton, Ellsworth G; Fukuda, Masato; Grobe, Hannes; Haeffelin, Martial; Hodges, Gary; Hyett, Nicole; Ijima, Osamu; Kallis, Ain; Knap, Wouter; Kustov, Vasilii; Lanconelli, Christian; Long, Charles; Longenecker, David; Lupi, Angelo; Maturilli, Marion; Mimouni, Mohamed; Ntsangwane, Lucky; Ogihara, Hiroyuki; Olano, Xabier; Olefs, Marc; Omori, Masao; Passamani, Lance; Pereira, Enio Bueno; Schmithüsen, Holger; Schumacher, Stefanie; Sieger, Rainer; Tamlyn, Jonathan; Vogt, Roland; Vuilleumier, Laurent; Xia, Xiangao; Ohmura, Atsumu; König-Langlo, Gert (2018): Baseline Surface Radiation Network (BSRN): structure and data description (1992-2017). Earth System Science Data, 10(3), 1491-1501
Hyperspectral images of tea
Spatial distribution of zoobenthos (sponges, echinoderms) in the wider Weddell Sea (Antarctica) with links to ArcGIS map packages
Additional-data
Iris flower dataset
W-band gyro-TWA
Interactive Hand Gesture
Charities in the United States
World Marathon Majors
Laptop Prices
Sokoto Coventry Fingerprint Dataset (SOCOFing)', 'Sokoto Coventry Fingerprint Dataset (SOCOFing)
A3130
Data.xlsx
Facilitating big data meta-analyses for clinical neuroimaging through ENIGMA wrapper scripts
Continuous meteorological monitoring at Cape Posillipo (Denza Institute weather station - Naples - Campania Region - Italy) during the period January 2014 - December 2018
Additional File 5
Scheduling In Cloud Computing
Mnist Data
Mango Transcriptome Assembly
Natural Speech Dataset
Dataset.xlsx
Advancing open science through NiData
Gender Recognition by Voice', 'Identify a voice as male or female
Music notes
IBI duration
DM Authors
TrainingInstitute
Hubway Data
World of Warcraft Avatar History
NCAR CESM Global Bias-Corrected CMIP5 Output to Support WRF/MPAS Research
Lending Club Loan Data
Abalone Dataset
Black Carbon measurements in Germany between 1994 and 2014, link to netCDF files, supplement to: Kutzner, Rebecca D; von Schneidemesser, Erika; Kuik, Friderike; Quedenau, Jörn; Weatherhead, Betsy; Schmale, Julia (2018): Long-term monitoring of black carbon across Germany. Atmospheric Environment, 185, 41-52
house price prediction
oregon education
APS dataset
The Blazing Signature Filter
Amino acid dataset
Supplementary Table 3
Identifiers on the Rise in Germany
Coder A
Lobbying Data
dataset cleaned
Data available for each species
TRIOML
US PRESIDENTS
Comic Books Images
Building Management System Analysis
Raw Data.xlsx
House Prices dataset
r/mexico
Gowalla Checkins
data_packet
PRIMAP-crf: UNFCCC CRF data in IPCC 2006 categories
Donald Trump Tweets
Board Game Data
Insult sets
painter test
Population Time Series Data
Movie Genre from its Poster
Data for figures
Gene matrix
Periodic table of the elements
Wikipedia Sentences
Passenger Satisfaction
Tatoeba Sentences
thyroid image data
Taxa partition
Terrestrial Water Budget Data Archive
Los Angeles Addresses
Dataset References
Consumer Complaints
MAS 5
MovieLens+IMDb
Cars Data
New draft item
Fantasy Premier League
CITES Wildlife Trade Database', 'A year in the international wildlife trade
Movie Reviews
Study 2 data
Fig 4a.
16/1 pollen surface sample dataset
car prediction
Drosophila Melanogaster Genome
Netflix Prize data', "Dataset from Netflix's competition to improve their reccommendation algorithm
Chest X-Ray Images (Pneumonia)', '5,863 images, 2 categories
Arabic Handwritten Digits Dataset
Wikisource
Internet Archive
DBpedia
OpenStreetMap
Wikidata
National Register of Historic Places
UNESCO World Heritage Site
MusicBrainz
ChemIDplus
Project Gutenberg
MEROPS
AlloCiné
AllMusic
Internet Broadway Database
IUCN Red List
Integrated Authority File
Internet Movie Database
Catalogue of Life
GameRankings
The Oxford English Dictionary
International Union for Conservation of Nature
Virtual International Authority File
Find a Grave
Who Named It?
Fortune 500
Encyclopedia of Life
National Center for Biotechnology Information
Integrated Taxonomic Information System
Aozora Bunko
Rotten Tomatoes
Beilstein database
Bonn Conventio
Gene Ontology
ZEMA
Geographic Names Information System
Transporter Classification database
Metacritic
OmegaWiki
Australian Plant Name Index
Biodiversity Heritage Library
PubMed
Last.fm
tz database
WikiMapia
Medical Subject Headings
Google Books
Research Papers in Economics
BirdLife International
The Zoological Record
Anime News Network
Box Office Mojo
PubMed Central
Europeana
British National Corpus
The European Library
Online Mendelian Inheritance in Man
Persée
VD 17
Gracenote
Collins English Dictionary
archINFORM
Pauline epistles
Artnet
Jeuxvideo.com
Eurogamer
freedb
Fortune 1000
Polity data series
AGROVOC
AMIS Plus
Aquatic Sciences and Fisheries Abstracts
Abandonia
eMedicine
Abbreviationes
Urban Dictionary
Bibliotheca Augustana
AcademiaNet
Aminet
Scopus
Corbis
ARKive
Karlsruher Virtueller Katalog
Windows Registry
American Battle Monuments Commission
World Digital Library
National Diet Library
AllMovie
Korean Movie Database
NASA/IPAC Extragalactic Database
Tatoeba
Discogs
Choral Public Domain Library
International Music Score Library Project
Coffin Texts
Linguist List
MedlinePlus
BRENDA
listed building in the United Kingdom
Bibliographic Ontology
Ordbog over det danske Sprog
MyHeritage
World Register of Marine Species
Art & Architecture Thesaurus
Deutsche Digitale Bibliothek
CIDOC Conceptual Reference Model
Arachne
Dublin Core
The Plant List
Perseus Project
EURODAC
Rodovid
SIMBAD
Deutsche Fotothek
The Merck Index
MetaCyc
arthistoricum.net
German Medical eLibrary
LibraryThing
Astrophysics Data System
ResearchGate
Registry of Toxic Effects of Chemical Substances
Netherlands Institute for Art History
Protein Data Bank
Australian National Heritage List
Austrian Literature Online
Instituto Nacional de Estadística y Geografía
Fossilworks
Joconde
Mathematics Genealogy Project
GeoNames
The Freesound Project
FishBase
Dictionary of Canadian Biography
YouPorn
WorldCat
Bridgeman Art Library
OKATO
Bildarchiv Foto Marburg
Bildindex
Registry of Open Access Repositories
Baseball-Reference.com
Biographical Portal
Contenta
BoardGameGeek
GenBank
ChEBI
Digital Literary Academy
UniProt
Grooveshark
Kyoto Encyclopedia of Genes and Genomes
International Plant Names Index
Atlas of the World's Languages in Danger
Project Runeberg
Pornhub
Encyclopaedia Metallum
PANGAEA
Linguee
International Children's Digital Library
Brown Corpus
International Shark Attack File
BugMeNot
Generally recognized as safe
Israeli Central Bureau of Statistics
Censimento nazionale delle edizioni italiane del XVI secolo
Structurae
Charity Navigator
Microsoft Academic Search
LibriVox
WikiTree
Cochrane Library
DrugBank
Society for American Baseball Research
Jamendo
World Atlas of Language Structures
Swedish Film Database
Crew United
Plena Ilustrita Vortaro de Esperanto
Current Index to Statistics
UbuWeb
Rate Your Music
Cyc
International HapMap Project
Hungarian Electronic Library
Online Etymology Dictionary
GEOnet Names Server
Pandora Radio
LyricWiki
Open Library
Web of Science
German Reference Corpus
Deutsches Textarchiv
dblp computer science bibliography
Digitale Bibliothek
Directory of Open Access Journals
DODIS
VirTheo
Pinakes
Shanghai Interbank Offered Rate
RedTube
e-rara.ch
Academic Search
GetInfo
Operabase
Schengen Information System
Encyclopedia of Triangle Centers
English Short Title Catalogue
Ensembl genome database project
MP3.com
ST 16
EudraVigilance
NNDB
Extrasolar Planets Encyclopaedia
FOAF
Max Planck Digital Library
Foundational Model of Anatomy
Filmdienst
JSTOR
Gesamtkatalog der Wiegendrucke
statistical business register
Flora of North America
Freebase
SIMAP
Web Gallery of Art
GEO-LEO
Gallica
George Eastman Museum
Orphanet
Unifrance
Getty Thesaurus of Geographic Names
Inter-Active Terminology for Europe
VD 16
Reaxys
Global Biodiversity Information Facility
Mammal Species of the World
MEDLINE
Mutopia Project
Vitis International Variety Catalogue
Norsk biografisk leksikon
PostGIS
Virtuelle Fachbibliothek Germanistik
Reptile Database
Immune Epitope Database and Analysis Resource
Index theologicus
VizieR
Virtual Manuscript Room
World Database on Protected Areas
VD 18
Stanford Physics Information Retrieval System
bibliographic database
Dicionário Houaiss da Língua Portuguesa
LIBRIS
Hessian Regional History Information System
Handbook of the Birds of the World
Les Classiques des sciences sociales
Missouri Botanical Garden
Index Fungorum
Marxists Internet Archive
Mouse Genome Informatics
Moviepilot
National Center for Education Statistics
online public access catalog
Geheugen van Nederland
WordReference.com
Personenstandsregister
Filmweb
Anefo
Stationers' Register
Reactome
Web of Knowledge
Cambridge Structural Database
SABIO-Reaction Kinetics Database
Victorian Heritage Register
Nederlands Soortenregister
Simple Knowledge Organization System
Social Security Death Index
Tree of Life Web Project
ChemSpider
Lib.ru
Swissbib
Systems Biology Ontology
KinoPoisk
Tenders Electronic Daily
Animal Sound Archive
Topten
Semantically-Interlinked Online Communities
Transfermarkt
Digital Library for Dutch Literature
UNILEX
Union List of Artist Names
AntWeb
Virtual Laboratory
Virtual Library Eastern Europe
Jewish Virtual Library
Greenstone
MycoBank
Tropicos
SUDOC
Swiss-Prot
FilmAffinity
Ameco
Sherdog
World of Spectrum
IntEnz
CiteSeerX
Fauna Europaea
academia.edu
ACME Newspictures
AlgaeBase
Alsatica
Amphibian Species of the World
Animal Diversity Web
BALaT
BASOL
European Cultivated Potato Database
Anatomography
Education Resources Information Center
British Trust for Ornithology
Digital Public Library of America
Common Locale Data Repository
Compendex
Dialnet
Orotariko Euskal Hiztegia
E-corpus
GeneReviews
EDGAR
Equasis
FamilySearch
FNAEG
Flora of Australia
FranceTerme
FRANCIS
GISAID
Global Invasive Species Database
HathiTrust
Hyper Articles en Ligne
Inspec
Center for Biological Diversity
NYPL Digital Gallery
National Science Digital Library
Open Food Facts
Powder Diffraction File
Proteopedia
Genius
Reverso
Digital Library of Slovenia
Canadian Register of Historic Places
Saccharomyces Genome Database
Schema.org
Dictionary of the Scots Language
International Nuclear Information System
SIRENE
Tela Botanica
Terrorist Identities Datamart Environment
The Arabidopsis Information Resource
National Digital Library of India
Harvard University Center for Italian Renaissance Studies
ČSFD
World Spider Catalog
WormBase
Rfam
AnimeClick
SciFinder
Library of Congress Online Catalog
XVideos
Grand Comics Database
IntraText
Liber Liber
Sardegna Digital Library
ODIS
JPL Small-Body Database
Kepler Input Catalog
MyAnimeList
Portuguese Web Archive
magazines.russ.ru
GOLD
Multitran
Russian National Corpus
ontology alignment
Speech corpus
Mushroom Observer
Dictionary of Algorithms and Data Structures
CDDB
Fundamental electronic library
CRIStin
BIBSYS
Filmweb
AGRICOLA
ATLA Religion Database
Adverse Event Reporting System
AgBase
Allele frequency net database
America's Most Endangered Historic Places
American Birding Association
American National Corpus
AmoebaDB
Analytical Sciences Digital Library
Anemi, Digital Library of Modern Greek Studies
AnimalTFDB
Animal Genome Size Database
Aquatic Commons
Archaeology Data Service
Arnetminer
Oxford Dictionaries
AusStage
AustLit: The Australian Literature Resource
Automated Similarity Judgment Program
Aviation Safety Reporting System
BIOSIS Previews
BISC
BabelNet
Baen Free Library
EPPO code
Islamonline.net
Biblioteca Virtual Miguel de Cervantes
MAINWAY
BindingDB
Bio2RDF
BioGRID
BioModels Database
BioOne
Biographical Directory of Federal Judges
BirdLife Australia
BitterDB
Bookshare
British Humanities Index
Brix
BugGuide
CAB Direct
CATH Protein Structure Classification database
CINAHL
COSMIC cancer database
California Digital Library
California Ethnic and Multicultural Archives
California Native Plant Society
California Register of Historical Resources
Chinese Text Project
National Commission for the Knowledge and Use of Biodiversity
Contemporary Authors
Copac
Core Historical Literature of Agriculture
Corpus of Contemporary American English
Corpus of Electronic Texts
Crossref
Cuneiform Digital Library Initiative
Cylinder Audio Archive
DAD-IS
DOAP
DPVweb
Database of Interacting Proteins
Death Master File
Department of Defense Serum Repository
Dietary Supplements
DigitalNZ
Digital Comic Museum
Digital Himalaya
Digital Library of Georgia
DisProt
Disease Ontology
Domínio Público
DroID
Drug Industry Document Archive
eBird
Embase
East London Theatre Archive
EcoCyc
eggNOG
Eighteenth Century Collections Online
Ekşi Sözlük
English-Arabic Parallel Corpus of United Nations Texts
Ensembl Genomes
Europarl corpus
Europe PubMed Central
European Nucleotide Archive
Exoplanet Archive
FADO
FRBRoo
Filmow
FloraBase
Flora of China
Fusarium graminearum genome database
Global Administrative Areas
GEISA
GSHHG
GWASdb
Gazetteer of Australia
GeneCards
Gene Wiki
Generic Model Organism Database
Genetic codes
GeoRef
PsycINFO
GALILEO
global microbial identifier
H-Invitational
HIV Drug Resistance Database
AQL
Hazardous Substances Data Bank
Historypin
Hispana
Human Metabolome Database
Human Protein Reference Database
IEEE Xplore
IGRhCellID
INSPIRE-HEP
Index Copernicus
information schema
Intercontinental Dictionary Series
International Protein Index
Interstate Identification Index
Invasive Species Compendium
Iraqi Virtual Science Library
IsoBase
RENAP
ChEMBL
Tebeosfera
KUPS
Kujawsko-Pomorska Digital Library
LIDB
LLMDB
Lancaster-Oslo-Bergen Corpus
Latindex
Lattes Platform
List of Prokaryotic names with Standing in Nomenclature
Latin American and Caribbean Center on Health Sciences Information
Lyon-Meudon Extragalactic Database
MAREC
MICAD
MICdb
MISLE
Making of America
Mapper(2)
Mapping the Practice and Profession of Sculpture in Britain and Ireland 1851–1951
Melvyl
MetroLyrics
miRBase
MiRTarBase
MimoDB
ModBase
Mouse Phenome Database
Mouse gene expression database
Munk's Roll
MuseData
MusicDNA
Musixmatch
NAPP
NCBI Epigenomics
NGSmethDB
NIAID ChemDB
NaPTAN
National Biodiversity Network
National Bridge Inventory
National Corpus of Polish
National Driver Register
National Elevation Dataset
National Software Reference Library
Natural Earth
neXtProt
New Advent
New Zealand Electronic Text Centre
Nolot
Norsk Ordbok
OER Commons
Online Books Page
OpenCorporates
Ordnett
OriDB
OrthoDB
Orthologous MAtrix
Oxford English Corpus
P2CS
PATRIC
PCRPi-DB
PDBsum
PHI-base
Panjab Digital Library
Pathway Commons
Penn World Table
Pennsylvania Sumerian Dictionary
PhilPapers
Phosida
Phospho3D
PhylomeDB
Planetary Data System
PPDB
Plant ontology
Plazi
Post-Reformation Digital Library
ProGlycProt
ProRepeat
Project Vote Smart
Protein circular dichroism data bank
Proteomics Identifications Database
Pseudogene
PubMed Central Canada
Publication of Archival, Library & Museum Materials
Quilt Index
REPAIRtoire
RNA-binding protein database
Registry of Open Access Repositories Mandates and Policies
REBASE
Redalyc
Regional Planetary Image Facility
Register of the National Estate
RxNorm
SAGE KE
SciELO
Screenonline
SeaLifeBase
SedDB
Sequence Ontology
Shiron.net
Sibley Music Library
Social Science Research Network
Spike
Atlas
StarBase
Synthetic gene database
TRICS
Tanums store rettskrivningsordbok
Taxatio Ecclesiastica
Technical Report Archive & Image Library
textfiles.com
Small Molecule Pathway Database
TopFIND
Toxin and Toxin-Target Database
UK Biobank
Uberon
VINITI Database RAS
VIOLIN
Viral Bioinformatics Resource Center
Virginia Landmarks Register
VoiD
WhoSampled
WikiPathways
Wikifonia
Women Writers Project
World Checklist of Selected Plant Families
World Guide to Covered Bridges
Xeno-canto
YAGO
ZINC database
ZooBank
ZoomInfo
Catalog of Fishes
e-teatr.pl
Glosbe
China Academic Library and Information System
AdoroCinema
Corpus Documentale Latinum Gallaeciae
Guia dos Quadrinhos
Internet Movie Script Database
AISLP
PlaymakerStats.com
DisGeNET
Spreadthesign
Svensk mediedatabas
CiNii
Crunchbase
Event Log
Mtime
J-STAGE
MRDB
NACSIS-CAT
Weblio
Czech Terminology Database of Library and Information Science
Automatic Fingerprint Identification System
Polona
Index Herbariorum
National Police Information System
Malopolska Digital Library
Narodowe Archiwum Cyfrowe
NUKAT
Documenta Catholica Omnia
Finnish Historical Newspaper Library
Terhikki
Finnish Social Science Data Archive
Naturbase
The Norwegian Patient Registry
Oncolex
Register of inhabitants
Kramerius
Visitors Location Register
BioCyc database collection
Euskalterm
Inguma
Liburuklik
Pomet
Scope
Statistikbanken
BioLib
Reta Vortaro
HebrewBooks
RAMBI
National Digital Science Library
Hungarian Periodicals Table of Contents Database
Corpus Scriptorum Historiae Byzantinae
New English-Irish Dictionary
DINOloket
Library of Congress Authorities
Manuscriptorium
Czech National Corpus
China Biographical Database
LithoLex
Tilastopaja
Muséofile
Taxonomy database of the U.S. National Center for Biotechnology Information
Hollandse Hoogte
FoundationDB
The National Map
Canadian Geographical Names Data Base
LncRNAdb
MinDat
Handbook of Mineralogy
rruff
webmineral.com
Sycomore
World Database of Happiness
Wikilivres
ViralZone
filmportal.de
Cochrane Database of Systematic Reviews
Database for Spoken German
GESTIS database
register of objects of cultural heritage
Ishim
Free Music Archive
The LiederNet Archive
Experimental Factor Ontology
BNAber
Netpath
Basisregistratie Personen
Buruxkak
Galiciana
xHamster
Crossroads Bank for Enterprises
Visual Novel Database
CyberLeninka
Helsinki Annotated Corpus
Ghana Club 100
Statutory List of Buildings of Special Architectural or Historic Interest
Data Catalog Vocabulary
National Automated Fingerprint Identification System
Matter of England
Barcode of Life Data Systems
Digital Image Archive of Medieval Music
COMBINE
National Biomedical Imaging Archive
iNaturalist
Frances G. Spencer Collection of American Sheet Music
Welsh Newspapers Online
Dictionary of Scottish Architects
Lester S. Levy Collection of Sheet Music
National Technical Reports Library
IMVDb
Queensland Heritage Register
Australian Organ Donor Register
Figshare
FlyExpress
Avicenna Directories
Poetry Archive
Analysis & Policy Observatory
Human Phenotype Ontology
BIBFRAME
MAQAM
Influenza Research Database
MNIST database
The Numbers
PROSESS
RefDB
International Tree-Ring Data Bank
CollecTF
Y Chromosome Haplotype Reference Database
IUPHAR/BPS Guide to PHARMACOLOGY
Library of Congress Linked Data Service
Anopress
VetBact
National Vital Statistics System
PREDITOR
Medical Heritage Library
Glottolog
Japan Center for Asian Historical Records
AUSTLANG
COMPLUDOC
hymnary.org
Global Terrorism Database
GDELT Project
SILVA ribosomal RNA database
PharmGKB
Volume Area Dihedral Angle Reporter
BioStor
National electronic library
Trove
OpenStreetMap Wiki
Open Science Framework
Finnish Population Information System
Doria
ArtFacts.Net
Finna
Digital Repository of Scientific Institutes
Calflora
Postcode data
National Pipe Organ Register
Aviation Safety Network accident description
AFL Tables
The Internet Hockey Database
Database of Vascular Plants of Canada
database of Genotypes and Phenotypes
GRIN Taxonomy for Plants
Driver Database
Center for Turkish Cinema Studies
Normannia
Archeological Information System
GrassBase
Bach Digital
Genealogics
Australian Bibliographic Network
BnF authorities
NIOSH pocket guide to chemical hazards
ClinVar
Internet Game Database
Bangumi
Gymnosperm Database
Revistes Catalanes amb Accés Obert
Dyntaxa
CERN Document Server
NCBI Gene
Shipyards
Medeltidens bildvärld
Fartyg
ACToR database
Avibase
swMATH
The Movie Database
KNApSAcK
LIPID MAPS
NDF-RT
Digital Archaeological Archive of Comparative Slavery
CLOCKSS
The Peerage
The Academic Family Tree
Database of Dutch first names by Meertens institute
FANTOM
Social Security Applications and Claims Index
Re-Member
Directory of Open Access Books
Qatar Digital Library
Austrian Parliament personal database
AGORHA
Library Genesis
Early Canadiana Online
Trismegistos
Renaper
Ontology Lookup Service
Basketball-Reference.com
Phenocarta
Zenodo
WomenWriters
PhDTree
Archives Service Center
OpenFDA
Semantic Scholar
Theoi Project
Knowledge Web
European Reference Index for the Humanities
TrEMBL
KuLaDig
Manioc
Croatian Scientific Bibliography
Fondazione Federico Zeri
UCI ChemDB
Panama Papers
Catalog of the German National Library
SciCrunch
OpenNeuro
New Zealand Organisms Register
Nutrient Tables for use in Australia
Index Hepaticarum
3DMet
British Nursing Index
Loop
Australasian Pollen and Spore Atlas
Gazetteer of Planetary Nomenclature
MIAR
MassBank
XMetDB
SureChEMBL
UniChem
NMRShiftDB
MetaboLights
eNanoMapper
NCBI Nucleotide
GreeNC
PATRIC
SuperCYP
BARCdb
NCBI Protein
caNanoLab
Nanomaterial Registry
BiGG Models
STITCH
diXa Data Warehouse
CrocBITE
MetaboAnalyst
RettBASE
euL1db
MethBank
General Internet Corpus of Russian
Jisho
Cellosaurus
ECARTICO
BioCarta
CeCaFDB
Library of Apicomplexan Metabolic Pathways
Metabolomics Workbench
PeroxisomeDB
JASPAR
FunCat
ZINC15
Adlr.link
Maria Austria Instituut
White Rose Research Online
ImageNet
Corpus Corporum
US Census Bureau International Data Base
PomBase
BacDive
Irama Nusantara
DIGAR
British Book Trade Index
Teuchos
Common Core of Data
Standards for Networking Ancient Prosopographies: Data and Relations in Greco-Roman Names
Spenserians
Lord Byron and his Times
Cranach Digital Archive
MitoAge
Legacies of British Slave-ownership
Catalogue of Life in Taiwan
CosIng database
Global Plants
OpenCitations Corpus
WikiGenomes
ENZYME
Mapillary database
MUSEFREM
NIOSHTIC-2
National Inventory of Dams
TAXREF
China National GeneBank
CompTox Chemistry Dashboard
open data portal
ratings.fide.com
South African Natural Compounds Database
Open PHACTS Discovery Platform
Gramene
Kunst im Stadtraum
CEUR Workshop Proceedings
earthquake.usgs.gov
Open Spectral Database
General Finnish Ontology
World Waterfall Database
CIViC database
Archive of Digital Art
DrugCentral
GRID
IntAct protein interaction database
kPath
UNdata
Ensembl Plants
Klosterdatenbank
OpenTrials
CycleBase
Library History Database
California Death Index
UniProt-GOA
The File Room
Mix'n'match
Panarctic Flora
Developmental FunctionaL Annotation at Tufts
LifeDB
Proteome Inc.
Syscilia
Congress.gov
Vagalume
Microsoft Academic
Evidence & Conclusion Ontology
Biblioteca de la UOC
Monuments database
Hockey-Reference.com
PROSPERO
European Genome-phenome Archive
Euro+Med Plantbase
MathSciNet
Global Ingredient Archival System
VICNAMES
Species-ID
MinIO
Harran Census
AuthorClaim
Zebrafish Model Organism Database
Open TG-GATEs
Devri
Stanford Natural Language Inference corpus
Finnish MP database
SNAC
Catalogus Philologorum Classicorum
GeoNames ontology
Critical Assessment of Protein Function Annotation
Open Metadata Registry
Bolin Centre for Climate Research
Infrafrontier
COCO
Egyptian Knowledge Bank
PsyArXiv
The Himalayan Database
Overnia
lobid-organisations
lobid-resources
Environment Ontology
WordSim-353
Dixi
Athenaeum
Google analogy test set
Inventories of American Painting and Sculpture
Stanford Question Answering Dataset
CuratedTREC
WebQuestions
WikiMovies
RFC Editor Repository
Catalogue of Illuminated Manuscripts
Canadian Civil Aircraft Register
The World's Airlines: Past, Present & Future
Sefaria
WikiPapers
The Monarch Initiative
Gary Kessler's File Signature Table
Movie Review Data
Australian Women's Register
Customer Review Datasets
Amazon product data
Stanford Sentiment Treebank
Large Movie Review Dataset
MPQA Opinion Corpus
Nominis
Ulysses database
MyVariant.info
FB15K
Brent corpus
NPS Internet Chatroom Conversations, Release 1.0
E-Theses Online Service
SecondHandSongs
SwissLipids
BioAssay Ontology
Extracellular RNA Atlas
Leipzig Corpora Collection
SocArxiv
Registry of Births, Deaths and Marriages Victoria
Queensland Registry of Births, Deaths & Marriages
Paradise Papers
ACM Digital Library
The Camelot Project
Semantic Wiki Vocabulary and Terminology
Botanico Periodicum Huntianum
ClinGen Allele Registry
Getty Iconographic Authority
WikiQA
Archives de littérature du Moyen Âge
SimLex-999
Rubenstein-Goodenough dataset
LinkedGeoData
GeoLinkedData
Common Voice
Early Modern Letters Online
Semantic Publishing and Referencing Ontologies
FRBR-aligned Bibliographic Ontology
Citation Typing Ontology
Bibliographic Reference Ontology
Skyscraper Center
CIFAR-10
CIFAR-100
Samla
SHERPA/Juliet
E-Periodica
CBCL Face Database
Commonwealth War Graves Commission database
Fine Arts Heritage Register
Altered States Database
20 Newsgroups data set
Cinema Treasures
BridgeReports.com
UMBC corpus
Ontobee
Inventory of evaluations performed by the Joint Meeting on Pesticide Residues
Ecocrop
Complex Portal
AmphibiaWeb
Valka
North Carolina Violent Death Reporting System
imSitu
Digital Repository of Ireland
Det Norske Akademis ordbok
Directory of Open Access Scholarly Resources
California Data Exchange Center
Mycology Collections data Portal
Genetic and Rare Diseases Information Center
Clojars
Plants of the World Online
UK Medical Heritage Library
donor register
Green's Dictionary of Slang
Biological Magnetic Resonance Data Bank
GEPRIS
Dutch War Memorial Database
Album of the Year
Central Library of National Technical University of Athens
MuIS
DNAtraffic
New York Public Library Digital Collections
Architectuurgids
datos.bne.es
50 Salads dataset
YouTube-8M
Parliamentary Information System
4TU.Centre for Research Data
Missouri Cancer Registry and Research Center
Nature Index
Royal College of Surgeons biographical database
govinfo
Interim Register of Marine and Nonmarine Genera
Sistema de Información Cultural
BEIC Digital Library
Merriam-Webster online dictionary
SofaScore
TFRRS
TESEO
WikiSQL
Food-10k
OER World Map
History of Geology and Mining
B.R.A.H.M.S.
DASH Repository (Harvard University)
Sol Genomics Network
Media Art Database
LC-QuAD
Kielitoimiston sanakirja
HOLLIS
Silent Era
Collective Biographies of Women
Bgee
Operone
Harvard Dataverse
ACGT Master Ontology
Adverse Event Reporting Ontology
Apollo Structured Vocabulary
Bacterial Clinical Infectious Diseases Ontology
Behavior Perspective Model
Battle Management Ontology
BioAssay Ontology
Biological Collections Ontology
Biomedical Ethics Ontology
Biomedical Grid Terminology
Blood Ontology
Bone Dysplasia Ontology
Cancer Cell Ontology
Cancer Chemoprevention Ontology
Cardiovascular Disease Ontology
Cell Behavior Ontology
Cell Culture Ontology
Cell Expression; Localization; Development and Anatomy Ontology
Cell Line Ontology
Cell Ontology
Cellular Microscopy Phenotype Ontology
Chemical Entities of Biological Interest
Chemical Information Ontology
Chemical Methods Ontology
Clusters of Orthologous Groups Analysis Ontology
Cognitive Paradigm Ontology
Common Anatomy Reference Ontology
Common Core Ontologies
Communication Standards Ontology
Comparative Data Analysis Ontology
Computational Neuroscience Ontology
Computer-Based Patient Record Ontology
Conceptual Model Ontology
Coriell Cell Line Ontology
Drug Interaction Ontology
Drug Ontology
Evidence & Conclusion Ontology
Drug-drug Interaction Ontology
Emotion Ontology
Epidemiology Ontology
Epilepsy and Seizure Ontology
Evolution Ontology
EXperimental ACTions Biomedical Protocol Ontology
Fission Yeast Phenotype Ontology
Food Ontology
Gene Regulation Ontology
General Information Model
Genomic Epidemiology Ontology
Health Data Ontology Trunk
Hemocomponents and Hemoderivatives Ontology
Host Pathogen Interactions Ontology
Human Interaction Network Ontology
Human Physiology Simulation Ontology
Infectious Disease Ontology
Information Artifact Ontology
Informed Consent Ontology
Interaction Network Ontology
Interdisciplinary Prostate Ontology Project
Knowledge Base Of Biomedicine
Lipid Ontology
Malaria Ontology
Materials Ontology
Mental Disease Ontology
Mental Functioning Ontology
Minimum Information Model for Patient Safety
Middle Layer Ontology for Clinical Care
Military Scenario Ontology
MIRO and IRbase: IT Tools for the Epidemiological Monitoring of Insecticide Resistance in Mosquito Disease Vectors
Model for Clinical Information
Mouse Pathology Ontology
Name Reaction Ontology
Nanoparticle Ontology
NeuroPsychological Testing Ontology
Neuroscience Information Framework Standard Ontology
Neural Electromagnetic Ontologies
New Upper Level Ontology
Non-Coding RNA Ontology
Ontologized Minimum Information About BIobank data Sharing
Ontology for Biobanking
Ontology for Biomedical Investigations
Ontology for Dengue Fever
Ontology for Drug Discovery Investigations
Ontology for Energy Investigations
Ontology for General Medical Science
Ontology for Genetic Interval
Ontology for Laparoscopic Surgeries
Ontology for MIcroRNA Target Prediction
Ontology for Newborn Screening and Translational Research
Ontology for Pain and Related Disability; Mental Health and Quality of Life
Ontology for Periodontitis
Ontology of Clinical Research
Ontology of Biobanking Administration
Ontology of Biological and Clinical Statistics
Ontology of Data Mining
Ontology of Datatypes
Ontology of Experimental Variables and Values
Ontology of Medically Related Social Entities
Ontology of Vaccine Adverse Events
Ontology-Based Data Access
Oral Health and Disease Ontology
Parasite Experiment Ontology
Patient Safetry Categorial Structure
Phenotypic Quality Ontology
Plant Ontology
Population and Community Ontology
Population Health Record
Porifera Ontology
Proteomics data and process provenance ontology
Protein Ontology
Quality of Service Ontology
RNA Ontology
Role Ontology
Saliva Ontology
Schistosomiasis Process Ontology
Scientific Evidence and Provenance Information Ontology
Semanticscience Integrated Ontology
Situation Awareness Ontology
Sleep Domain Ontology
Software Ontology
Statistics Ontology
Subcellular Anatomy Ontology of Suggested Ontology for Pharmacogenomics
Surface Water Ontology
Time Event Ontology
Translational Medicine Ontology
Tumour-Node-Metastasis Ontology
Microbial Typing Ontology
Universal Core Semantic Layer
Vaccination Informed Consent Ontology
Vaccine Ontology
Vital Sign Ontology
Xenopus Anatomy Ontology
Zebrafish Anatomical Ontology
ZOBODAT
Signatures of Majorana fermions in hybrid superconductor-semiconductor nanowire devices
YP130
SemEval 2012 Task 2 dataset
Polish language corpus
Mol-Instincts
Trademark Electronic Search System
VOCEDplus
National Road Data Bank
DigitalCommons@UMaine
annuaire prosopographique: la France savante
RailLexic
FRBR-aligned Bibliographic Ontology
RCSB protein data bank
PDBe
Number World
Satyricon
Automated Weather Data Network
ChemInform
Open Data Web
EM-DAT
SHARE Catalogue
Morphbank
Ġabra
PAULING FILE
Soybase
Legume Information System
PeanutBase
MaizeGDB
PlantGDB
Guardiana
Small Bodies Node
National Coronial Information System
TriviaQA
Digitale Bibliothek Braunschweig
EdShare
NC DOCKS
Diposit Digital de Documents de la UAB
Digital Commons@Wayne State University
FreiDok
SoilGrids
10,000 Immunomes
Statistics on income and living conditions
The Federal Register of Legislation
FEIS
Systematic Catalog of Culicidae
Allen Coral Atlas
Map of Life
Mitochondrial Disease Database
Michigan Flora
SpringerLink
National Identity Register
Online catalog
Jordan Antiquities Database and Information System
Missing and Murdered Indigenous Women and Girls
eFloraSA
Conservation and Art Materials Encyclopedia Online
U.S. Geologic Names Lexicon
National Geologic Map Database
A Space of Their Own
Manufacturer and User Facility Device Experience
4TU.Centre for Research Data (4TU.ResearchData)
Data Series
International Pharmaceutical Abstracts
IndexCat
Index-Catalogue of the Library of the Surgeon-General’s Office
Applied Science & Technology Index
Kepler Finance
European Criminal Records Information System
Biblioteca Virtual de Defensa
Butterflies of India
Moths of India
Odonata of India
Reptiles of India
Birds of India
Moths of North America
Palynological Database
Plantarium
GONIAT
Catalogue of the Lepidoptera of Belgium
Leeds Robotic Commands
C. V. Starr Virtual Herbarium
Networked Digital Library of Theses and Dissertations
Mineral Resource Data System
International Fossil Plant Names Index
Feminae: Medieval Women and Gender Index
Epistolae: Medieval Women's Letters
Litchfield Ledger
PseudoCAP
SynGO
YuBioLab
Psyl'list
LibriSpeech
TED-LIUM corpus
45worlds
CREMA-D
Gateway to Research
Orlando
Measurement Units Ontology
Extensible Observation Ontology
Library for Quantity Kinds and Units
Semantic Web for Earth and Environmental Terminology
Microsoft Academic Knowledge Graph
BroadwayWorld
Logeion
Collection #1
Fleuron
CERL Thesaurus
Hall of Light
Something About the Author
Global Species
ScaleNet
Orphan Works Database
Lucerna
MuseumFinland
Semantic Kalevala
BookSampo
Index to American Botanical Literature
Microsoft Academic Graph
JRC Names
Time Ontology in OWL
Corpus of Linguistic Acceptability
Japan Search
Comédie-Française Registers Project
Genetics Home Reference
Arabic Ontology
SIUSA
The Bhagavad-Gita
Archaeology Data Service library
Profiles in Science
Register of Antarctic Marine Species
Kalos
FB15K-237
Pinakes
Culture Collections Information Worldwide
KBpedia
Classify
digilibLT
PHI Latin Texts
National Population Register
Levidata
Genomics England PanelApp
CellMarker
myschool
Australian Stratigraphic Units Database
Bionomia
MVDBase
Six Degrees of Francis Bacon
Gambay
International Labour Organization statistics database
Libraries.org
Find & Connect
eFloraSA
Musisque Deoque
PathoPhenoDB
Ukrainica
NSW Beach Profile Database
electrocd
SciGraph
AusPat
Australian Food Composition Database
Lawcodes
JUSTfind
Dcine.org
Australian Marine Algal Name Index
Global Names Index
Vidwan
SemCor
AllTrails
Cross-National Socio-Economic and Religion Data, 2011
Exceptional Experience Questionnaire
General Social Survey, 1993
General Social Survey, 1994
General Social Survey, 1996
General Social Survey, 2002
General Social Survey, 2004
General Social Survey, 2006
National Survey of High School Biology Teachers
Lilly Survey of Attitudes and Social Networks
Spirit and Power: Survey of Pentecostals in Guatemala
Biomedicina Slovenica
OpenUp
Religion among Academic Scientists
Religion in Italy
Spiritual Life Study of Chinese Residents
PCORnet
Calendrier électronique des spectacles sous l'Ancien Régime et la Révolution
Endangered Archives Programme
Carolina Digital Repository
AMS Tesi di Dottorato
American Memory
American Mineralogist Crystal Structure Database
ALT Open Access Repository
Agritrop
AgEcon Search
AHERO
ACMAC
ARRT
Archivio Istituzionale
Archivio Giuliano Marini
OpenScore
FIA Results and Statistics
National Digital Library of Theses and Dissertations in Taiwan
Northernstars.ca
ISSN Portal
Contributor Role Ontology
Proff
Finnish Biodiversity Information Facility
Power Reactor Information System
Legends World
World Values Survey, 2005
World Values Survey, 2010
District of Columbia Inventory of Historic Sites
Museum of Modern Art online collection
Library Publishing Directory
TESEO
University of Chicago Photographic Archive
New Zealand Heritage List
Cinema Context
Neliti
Taiwan Cinema
Pleias
Gender Studies Resources Database
WorldCat Identities
AACT Database
The Kidney & Urinary Pathway Knowledge Base
Newsroom dataset
Sequence Database Setup: MSDB
World Flora Online
Scilit
The Digital Archaeological Record
Speech Accent Archive
Garaph
DOIBoost Dataset Dump Version 3
Comprehensive Aramaic Lexicon
PubAg
MicrobeDB
ProKinO: Protein Kinase Ontology Browser
AnAge
Vision AI
The Natural Products Atlas
LiverTox
Microworld
Unified Cyber Ontology
The Good Old Days
Mapa da Cultura
Sistema Cultura
gene2phenotype
Symptom Ontology
Map of Early Modern London
8-bits
VGMRips
Roglo
Civil registration
District Digital
Illinois Digital Heritage Hub
Indiana Memory
Kentucky Digital Library
Minnesota Digital Library
Missouri Hub
North Carolina Digital Heritage Center
PA Digital
Plains to Peaks Collective
The Portal to Texas History
South Carolina Digital Library
Floridata
Exposome-Explorer
Calaix
The Vault at Pfaff's
Bird tracking - GPS tracking of Lesser Black-backed Gulls and Herring Gulls breeding at the southern North Sea coast
HotpotQA
SearchQA
Members of the European Parliament
Open Super-large Crawled ALMAnaCH coRpus
Poeti d'Italia in lingua latina
DGA Member Directory
Natural Questions
WikiHop
SynTagRus
PanTHERIA
QALD-9
Hemeroteca Nacional Digital de Mexico
NCBI Genome
NCBI Assembly
Attic Inscriptions Online
National Record of the Historic Environment
Digital Library of South Dakota
CIRIS
Chinese Clinical Trial Registry
Decoda
COVID Tracking Project
COVID-19 Community Mobility Reports
ALCUIN
Spanish National Catalog of Hospitals
BBMRI-ERIC Directory
FactGrid
Bang!
Hot Film
CLEVR
Héloïse
Epistemonikos
bab.la
New Zealand Gazetteer
Archaeology in Greece Online
iDAI.gazetteer
kb.nl
EDBL
Bibliopolis
Common Vulnerabilities and Exposures
Bioweb Ecuador
Open Images Dataset
Places database
DataPile
Elephant Encyclopedia
Land Use Database
World Checklist of Vascular Plants
LSE Digital Library
LSE Research Online
LSE Theses Online
Linked Stage Graph
AccessAble
80 Million Tiny Images
Icarus Films
Tiaki
Polish scientist
Gender, Sex, and Sexual Orientation Ontology
CephBase
GlyGen
ZivaHub
UNESDOC
The minority health & health equity archive
USGS ScienceBase
Google People Cards
VertNet
depositar
Library Hub Discover
Datastream
Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Apache License 2.0 Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Context This dataset contains data from a list of Indian stocks in NSE. It includes a collection of well performing stocks with all the data necessary to predict which stocks to buy, hold, or exit. Acknowledgements I work in a stock research firm. This stock data is for all Kaggle users to play and experiment with in order to learn more about stock research. Inspiration The second column, "Category", gives a list of all the stocks that a user needs to buy, hold, or exit . We challenge you to develop an algorithm to see if your result matches ours.
On Friday, August 11th, 2017 a large groups of racist white nationalists carrying torches marched on the University of Virginia campus in Charlottesville, VA as an intimidation tactic against proponents for the removal of confederate statues of Robert E. Lee. The Friday evening march was held ahead of a much larger racist white nationalist rally in the center of Charlottesville planned for Saturday, August 12th, 2017. This dataset includes 100,000 tweet ids collected using the DocNow tweet collection prototype: http://app.docnow.io/ The tweet ids can be converted back into the original tweets using the DocNow Hydrator tool which can be downloaded from here: https://github.com/DocNow/hydrator
A consistently calibrated 40-year record of visible channel remote sensing reflectances (Rrs), based on the Advanced Very High Resolution Radiometer (AVHRR) sensor global time-series. The dataset is derived from the top of atmosphere visible channel reflectances provided by the Pathfinder Atmospheres - Extended (PATMOS-x) V5.3 Climate Data Record (CDR), atmospherically corrected and masked according to quality flags. Temporal filtering and selective masking of the Rrs product is used to highlight regions of the global ocean affected by highly reflective blooms of the coccolithophorid Emiliania Huxleyi over the past four decades.Both the Rrs and coccolithophorid bloom product are supplied at monthly resolution on a 0.1 x 0.1 degree global grid. Monthly mean and monthly maximum values are supplied for each product. Requests for daily files can be made to Plymouth Marine Laboratory.
The six datasets of vineyard thermal information acquired by on-the-go thermal imaging, to predict water status:- With thermal indices: - East side: train and test. - West side: train and test. - Global model: with both sides; train and test.- Without thermal indices: - East side: train and test. - West side: train and test. - Global model: with both sides; train and test.
The RAW files in this dataset can be converted to .mzXML using Proteowizard (available at http://proteowizard.sourceforge.net) and then viewed using Skyline (available through MacCoss Lab at https://skyline.gs.washington.edu/labkey/project/home/begin.view).
Seven Crystallographic Information Files obtained from EPSRC first grant project studying solvent separated magnesium organohaloaluminates relevant to rechargeable battery electrolytes The first dataset contributed to Dalton Trans., 2016, doi: 10.1039/C6DT00531D, published online as an accepted manuscript 22/02/16
Nexus and .tre files for the single-gene analyses of the Canarina dataset
Raw data for the Ferndale Bog loss-on-ignition dataset obtained from the Neotoma Paleoecological Database.
Copyright information:Taken from "A procedure for identifying homologous alternative splicing events"http://www.biomedcentral.com/1471-2105/8/260BMC Bioinformatics 2007;8():260-260.Published online 19 Jul 2007PMCID:PMC1950890. In the figure we highlight these two processes with a different colour code, red for the training and blue for the testing. We followed a two-fold heterogeneous cross-validation scheme [50] in which the original dataset was split in two (training and test sets). A resampling protocol was applied to correct for class-imbalance effects [51], resulting in 100 training sets with the same proportion of correct and incorrect observations. Each training set was then utilised to train a NN. We applied the latter to the events in the test set and computed the success rate. The success rate given in the article is the average of the success rates for the 200 NN.
Context I get this dataset from UCI Machine Learning. I very interested with this dataset because one of our global warming problem is about air quality in some big city very serious. In UCI ML get this data from sensor device that located in Italy. Also you can read about the dataset in the description. Content I get this data from UCI Machine Learning. Here is about descripstion rows and column also another description. "The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value. " Acknowledgements Thank to UCI [https://archive.ics.uci.edu/ml/index.php][1] Inspiration I would like to see another method to classify or cluster this dataset with timeseries purpose. [1]: https://archive.ics.uci.edu/ml/index.php
Did you ever go through your vacation photos and ask yourself: What is the name of this temple I visited in China? Who created this monument I saw in France? Landmark recognition can help! This technology can predict landmark labels directly from image pixels, to help people better understand and organize their photo collections. Today, a great obstacle to landmark recognition research is the lack of large annotated datasets. This motivated us to release Google-Landmarks, the largest worldwide dataset to date, to foster progress in this problem. The dataset is divided into two sets of images, to evaluate two different computer vision tasks: recognition and retrieval. The data was originally described in [1], and published as part of the [Google Landmark Recognition Challenge](https://www.kaggle.com/c/landmark-recognition-challenge) and [Google Landmark Retrieval Challenge](https://www.kaggle.com/c/landmark-retrieval-challenge). Additionally, to spur research in this field, we have open-sourced Deep Local Features (DELF), an attentive local feature descriptor that we believe is especially suited for this kind of task. DELF\'s code can be found on github via [this link](https://github.com/tensorflow/models/tree/master/research/delf). If you make use of this dataset in your research, please consider citing: `H. Noh, A. Araujo, J. Sim, T. Weyand, B. Han, "Large-Scale Image Retrieval with Attentive Deep Local Features", Proc. ICCV\'17` Challenges The two challenges associated to this dataset can be found in the following links: * [Google Landmark Recognition Challenge](https://www.kaggle.com/c/landmark-recognition-challenge) * [Google Landmark Retrieval Challenge](https://www.kaggle.com/c/landmark-retrieval-challenge) CVPR\'18 Workshop The [Landmark Recognition Workshop](https://landmarkscvprw18.github.io) at [CVPR 2018](http://cvpr2018.thecvf.com/program/workshops) will discuss recent progress on landmark recognition and image retrieval, taking into account the results of the above-mentioned challenges. Top submissions for the challenges will be invited to give talks at the workshop. Content The dataset contains URLs of images which are publicly available online (this [Python script](https://www.kaggle.com/tobwey/landmark-recognition-challenge-image-downloader) may be useful to download the images). Note that no image data is released, only URLs. The dataset contains test images, training images and index images. The test images are used in both tasks: for the recognition task, a landmark label may be predicted for each test image; for the retrieval task, relevant index images may be retrieved for each test image. The training images are associated to landmark labels, and can be used to train models for the recognition and retrieval challenges (for a visualization of the geographic distribution of training images, see [2]). The index images are used in the retrieval task, composing the set from which images should be retrieved. Note that the test set for both the recognition and retrieval tasks is the same, to encourage researchers to experiment with both. We also encourage participants to use the training data from the recognition task to train models which could be useful for the retrieval task. Note, however, that there are no landmarks in common between the training/index sets of the two tasks. The images listed in the dataset are not directly in our control, so their availability may change over time, and the dataset files may be updated to remove URLs which no longer work. Dataset construction The training and index sets were constructed by clustering photos with respect to their geolocation and visual similarity using an algorithm similar to the one described in [3]. Matches between training images were established using local feature matching. Note that there may be multiple clusters per landmark, which typically correspond to different views or different parts of the landmark. To avoid bias, no computer vision algorithms were used for ground truth generation. Instead, we established ground truth correspondences between test images and landmarks using human annotators. License The images listed in this dataset are publicly available on the web, and may have different licenses. Google does not own their copyright. References [1] H. Noh, A. Araujo, J. Sim, T. Weyand, B. Han, "Large-Scale Image Retrieval with Attentive Deep Local Features", Proc. ICCV\'17 [2] A. Araujo, T. Weyand, "Google-Landmarks: A New Dataset and Challenge for Landmark Recognition", Google Research blog post, available online [here](https://research.googleblog.com/2018/03/google-landmarks-new-dataset-and.html) [3] Y.-T. Zheng, M. Zhao, Y. Song, H. Adam, U. Buddemeier, A. Bissacco, F. Brucher T.-S. Chua, H. Neven, “Tour the World: Building a Web-Scale Landmark Recognition Engine,” Proc. CVPR’09
Context The National Park Service publishes a database of animal and plant species identified in individual national parks and verified by evidence — observations, vouchers, or reports that document the presence of a species in a park. All park species records are available to the public on the National Park Species portal; exceptions are made for sensitive, threatened, or endangered species when widespread distribution of information could pose a risk to the species in the park. Content National Park species lists provide information on the presence and status of species in our national parks. These species lists are works in progress and the absence of a species from a list does not necessarily mean the species is absent from a park. The time and effort spent on species inventories varies from park to park, which may result in data gaps. Species taxonomy changes over time and reflects regional variations or preferences; therefore, records may be listed under a different species name. Each park species record includes a species ID, park name, taxonomic information, scientific name, one or more common names, record status, occurrence (verification of species presence in park), nativeness (species native or foreign to park), abundance (presence and visibility of species in park), seasonality (season and nature of presence in park), and conservation status (species classification according to US Fish & Wildlife Service). Taxonomic classes have been translated from Latin to English for species categorization; order, family, and scientific name (genus, species, subspecies) are in Latin. Acknowledgements The National Park Service species list database is managed and updated by staff at individual national parks and the systemwide Inventory and Monitoring department. Source: https://irma.nps.gov/NPSpecies Users interested in getting this data via web services, please go to: http://irmaservices.nps.gov
Source Gathered from the New York Times API for Hardcover Fiction best sellers from June 7, 2008 to July 22, 2018 The API can be found here: [https://developer.nytimes.com/][1] Collected data includes the book title, author, the date of the best seller list, the published date of the list, the book description, the rank (this week and last week), the publisher, number of weeks on the list, and the price. [1]: https://developer.nytimes.com/
This dataset has been uploaded from VDiscover on [Github](https://github.com/CIFASIS/VDiscover)
This geodatabase was built to cover several geothermal targets developed by Flint Geothermal in 2012 during a search for high-temperature systems that could be exploited for electric power development. Several of the thermal springs and wells in the Routt Hot Spring and Steamboat Springs areahave geochemistry and geothermometry values indicative of high-temperature systems. Datasets include: 1. Results of reconnaissance shallow (2 meter) temperature surveys 2. Air photo lineaments 3. Groundwater geochemistry 5. Georeferenced geologic map of Routt County 6. Various 1:24,000 scale topographic maps
Context Dataset available at https://datamarket.com/data/set/22u3/international-airline-passengers-monthly-totals-in-thousands-jan-49-dec-60!ds=22u3&display=line
The CLAAS-2 record provides cloud properties derived from the SEVIRI sensor onboard METEOSAT second generation (MSG) satellites. This second edition is the improved and extended follow-up of the first version of the record (Stengel et al., 2014; CLAAS-1 DOI:10.5676/EUM_SAF_CM/CLAAS/V001). In order to ensure a homogeneous data basis, the solar SEVIRI channels of MSG-1, MSG-2 and MSG-3 were intercalibrated (Meirink et al, 2013) with MODIS Aqua before applying the cloud retrievals. CLAAS-2 features 12 years (2004-2015) of cloud mask/type, cloud top temperature/pressure/height, cloud phase as well as cloud microphysical properties such as optical thickness, effective droplet radius and cloud water path. The data are available on native SEVIRI resolution, i.e. 15 minutes repeat cycle and 3km (nadir) to 11km (edge of the field of view) spatial resolution. In addition, spatio-temporal averages of the above mentioned cloud properties are included: Daily and monthly averages and monthly histograms on a 0.05° x 0.05° grid as well as monthly mean diurnal cycles on a 0.25° x 0.25° grid. The advancements compared to CLAAS-1 (DOI:10.5676/EUM_SAF_CM/CLAAS/V001) are for example: (1) extended MSG measurement record used with better calibration, (2) improvements made to the retrieval algorithm leading to products with higher quality and (3) increased temporal resolution (15 Minutes). A summary on the CLAAS-2 characteristics and a comprehensive evaluation of the results are currently documented in Benas et al. (2016). Along with the data, a comprehensive documentation including user guide, algorithm descriptions, reprocessing layout and extensive validation studies, is provided. With CLAAS-2, regional and large scale cloud processes at temporal scales of minutes to years can be studied. SEVIRI-based surface radiation products, which were part of CLAAS-1, are now released in a separate dataset (SARAH-2).
The zip files contains 12338 datasets for outlier detection investigated in the following papers:(1) Instance space analysis for unsupervised outlier detection Authors : Sevvandi Kandanaarachchi, Mario A. Munoz, Kate Smith-Miles (2) On normalization and algorithm selection for unsupervised outlier detection Authors : Sevvandi Kandanaarachchi, Mario A. Munoz, Rob J. Hyndman, Kate Smith-MilesSome of these datasets were originally discussed in the paper: On the evaluation of unsupervised outlier detection:measures, datasets and an empirical studyAuthors : G. O. Campos, A, Zimek, J. Sander, R. J.G.B. Campello, B. Micenkova, E. Schubert, I. Assent, M.E. Houle.
Every replicates(individual or together) in the 3 datasets H3K4me3, H3K27me3 and H3K36me3 and dataset H3K27ac have a fixed number of cell-types in it. H3K4me3 has two replicates: 1 and 2 . H3K27 holds replicate 1 and replicates 1 and 2 together. Dataset H3K36 has only replicate 1. For combined analysis of H3K4me3 and H3K27me3 we have a folder named H3K4me3-H3K27me3(combined). In each dataset folder there are 4 subfolders named IQA, MLQA ,ML and Overlap representation. Here in the three folders named ML, MLQA and IQA we have included the results from these three cell-type tree generation methods. All the three folders contain cell-type tree in newick format. Estimated quartet files which we generated for MLQA and IQA methods, are given in both MLQA and IQA folders. Finally the overlap representation data for the cell-types are in the folder Overlap representation. In this folder we have a text file named Overlap_datarepresentaion in which the the two numbers in the first row contains the number of cell-types and data length. After that each row identified by t1,t2 etc carries the overlap data. The mapping from t1, t2 etc to original cell-types are provided in file_sequence text file.
The folder contains all the R script files and data used in the paper. Two datasets are trimmed .csv versions of Table S1 and S3 and the third is functional data for North American mammals from the EltonTraits 1.0 database. Wilman, H., Belmaker, J., Simpson, J., la Rosa, de, C., Rivadeneira, M. M., & Jetz, W. (2014). EltonTraits 1.0: species-level foraging attributes of the world’s birds and mammals. Ecology, 95(7), 2027. To recreate all analyses and figures used in the paper, open the script “Davis Disassembly Main Code Open First.R”, highlight all the text and click run. Make sure that all the script files and data .csv files are in the same folder and that that folder is set as your working directory.
Context I was looking for a good dataset for learning and research purposes. I always kept in mind a collection which could be used for a sorting robot machine in later stage. Lego bricks are good candidates. At first thought I did some experimentation to photograph bricks from different angles but this was time consuming. That is why I turned to computer rendering of the bricks using Blender. Content In this dataset you will find 16 different lego bricks. Each brick is selected in Mecabricks.com and next imported in collada (.dae) format in Blender. I used an animator object to render the imported brick from 400 different angles. Acknowledgements Blender is free and Open 3D Creation Software. Mecabricks.com is a free online Lego modeling tool. Inspiration I hope you can take advantage of this simple set in your learning or research. Let me know if there is need to expand the dataset to more bricks. Enjoy!
Datasets obtained from the single locus markov chain model + R scripts for producing the figures presented within the article.
Recent dataset comprising clinical and US data
Table_S2.xlsx: Presence/Absence of genes x taxa Genes present by taxon. 1 = present; 0 = absent. Genes are given by OG name (OrthoMCL) and are marked with an * if a member of the 150 most even gene dataset and with a ^ if identified as a gene affected by EGT
Copyright information:Taken from "Genome-wide identification of functionally distinct subsets of cellular mRNAs associated with two nucleocytoplasmic-shuttling mammalian splicing factors"Genome Biology 2006;7(11):R113-R113.Published online 30 Nov 2006PMCID:PMC1794580. Unsupervised clustering of the microarray dataset was performed with the dChip software using standard settings considering all nonredundant probes with positive hybridization signal. The dataset includes microarray hybridization results from input and immunoprecipitation (IP) samples from three experiments with anti-U2AFantibody (U1 to U3) and two experiments with anti-PTB antibody (P1 and P2). Sample clustering defines a tree with two first level branches corresponding to input and IP samples. Re-clustering analysis after clearing transcripts that were over-represented either in the inputs or in all immunoprecipitation samples. Sample clustering defines a tree with three first level branches corresponding to input, U2AF, and PTB immunoprecipitation samples. For clustering analysis, the probe signal intensities for each mRNA are standardized to have mean 0 and standard deviation 1 across all samples. The color scale for mRNAs is presented as follows: red represents expression level above mean expression of a gene across all samples, black represents mean expression; and green represents expression lower than the mean. Because of the standardization, probe signal intensities most likely fall within [-3, 3]. PTB, polypyrimidine tract binding protein; U2AF, U2 small nuclear RNP auxiliary factor.
Dataset includes 14C measurements made from soil organic matter and CO2 from paired anaerobic and aerobic laboratory soil incubations of active layer soils collected in Barrow, Alaska in 2014. In addition to 14CO2, dataset includes CO2 production rates and carbon and nitrogen concentrations. Samples were collected from intensive study site 1 areas A, B, and C, and the site 0 and AB transects, from specified positions in high-centered, flat-centered, and low centered polygons.
Context The USDA Plant database extraction from the Natural Resources Conservation Service. Content It contains a wide variety of varieties in raw format. Inspiration There is currently no USDA plant information available via API. Using this data set I'm hoping that I can extract needed information for improved plant growth in controlled environments.
Context Global Vector or GloVe is an unsupervised learning algorithm for obtaining vector representations for words Content Contains 4 files for 4 embedding representations. 1. glove.6B.50d.txt - 6 Billion token and 50 Features 2. glove.6B.100d.txt - 6 Billion token and 100 Features 3. glove.6B.200d.txt - 6 Billion token and 200 Features 4. glove.6B.300d.txt - 6 Billion token and 300 Features Acknowledgements https://nlp.stanford.edu/projects/glove/
3 million Russian troll tweets This data was used in the FiveThirtyEight story [Why We’re Sharing 3 Million Russian Troll Tweets](https://fivethirtyeight.com/features/why-were-sharing-3-million-russian-troll-tweets/). This directory contains data on nearly 3 million tweets sent from Twitter handles connected to the Internet Research Agency, a Russian "troll factory" and a defendant in [an indictment](https://www.justice.gov/file/1035477/download) filed by the Justice Department in February 2018, as part of special counsel Robert Mueller\'s Russia investigation. The tweets in this database were sent between February 2012 and May 2018, with the vast majority posted from 2015 through 2017. FiveThirtyEight obtained the data from Clemson University researchers [Darren Linvill](https://www.clemson.edu/cbshs/faculty-staff/profiles/darrenl), an associate professor of communication, and [Patrick Warren](http://pwarren.people.clemson.edu/), an associate professor of economics, on July 25, 2018. They gathered the data using custom searches on a tool called Social Studio, owned by Salesforce and contracted for use by Clemson\'s [Social Media Listening Center](https://www.clemson.edu/cbshs/centers-institutes/smlc/). The basis for the Twitter handles included in this data are the [November 2017](https://democrats-intelligence.house.gov/uploadedfiles/exhibit_b.pdf) and [June 2018](https://democrats-intelligence.house.gov/uploadedfiles/ira_handles_june_2018.pdf) lists of Internet Research Agency-connected handles that Twitter [provided](https://democrats-intelligence.house.gov/news/documentsingle.aspx?DocumentID=396) to Congress. This data set contains every tweet sent from each of the 2,752 handles on the November 2017 list since May 10, 2015. For the 946 handles newly added on the June 2018 list, this data contains every tweet since June 19, 2015. (For certain handles, the data extends even earlier than these ranges. Some of the listed handles did not tweet during these ranges.) The researchers believe that this includes the overwhelming majority of these handles’ activity. The researchers also removed 19 handles that remained on the June 2018 list but that they deemed very unlikely to be IRA trolls. In total, the nine CSV files include 2,973,371 tweets from 2,848 Twitter handles. Also, as always, caveat emptor -- in this case, tweet-reader beware: In addition to their own content, some of the tweets contain active links, which may lead to adult content or worse. The Clemson researchers used this data in a working paper, [Troll Factories: The Internet Research Agency and State-Sponsored Agenda Building](http://pwarren.people.clemson.edu/Linvill_Warren_TrollFactory.pdf), which is currently under review at an academic journal. The authors’ analysis in this paper was done on the data file provided here, limiting the date window to June 19, 2015, to Dec. 31, 2017. The files have the following columns: Header | Definition ---|--------- `external_author_id` | An author account ID from Twitter `author` | The handle sending the tweet `content` | The text of the tweet `region` | A region classification, as [determined by Social Studio](https://help.salesforce.com/articleView? id=000199367&type=1) `language` | The language of the tweet `publish_date` | The date and time the tweet was sent `harvested_date` | The date and time the tweet was collected by Social Studio `following` | The number of accounts the handle was following at the time of the tweet `followers` | The number of followers the handle had at the time of the tweet `updates` | The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes `post_type` | Indicates if the tweet was a retweet or a quote-tweet `account_type` | Specific account theme, as coded by Linvill and Warren `retweet` | A binary indicator of whether or not the tweet is a retweet `account_category` | General account theme, as coded by Linvill and Warren `new_june_2018` | A binary indicator of whether the handle was newly listed in June 2018 If you use this data and find anything interesting, please let us know. Send your projects to oliver.roeder@fivethirtyeight.com or [@ollie](https://twitter.com/ollie). The Clemson researchers wish to acknowledge the assistance of the Clemson University Social Media Listening Center and Brandon Boatwright of the University of Tennessee, Knoxville.
Description of the genomic datasets (nanopore and illumina).
(A) Global surface heat flow measurements (International Heat Flow Commission) plotted by binning the point-measurement datasets to an adaptively-refined triangulation with a minimum resolution of 50 km. Color saturation represents the resolution of the triangulation which washes out progressively such that the coarsest resolution is shown with only 5% saturation.
In order to allow full comparability with other ocean acidification data sets, the R package seacarb (Gattuso et al, 2016) was used to compute a complete and consistent set of carbonate system variables, as described by Nisumaa et al. (2010). In this dataset the original values were archived in addition with the recalculated parameters (see related PI). The date of carbonate chemistry calculation by seacarb is 2018-02-02.
This datasets contains txt file with all words from different languages like english or french for example.
This dataset contains original and imputed data from the publication
Each column represents a vector of 3720 elements that were obtained from the calculation of 20 PCA, for each RGB color matrix , to the image dataset
US Stock Intra-day Dataset
Columns 1. Name of Planet 2. Weight by World 3. Diameter (km) 4. Average Distance from Sun (km) 5. Gravity (Earth=1) 6. Time to Orbit Sun (a day) 7. Time to Spin on Axis (a minutes) 8. Number of Known Moons 9. Year of Discovery 10. Average Temperature (°C) 11. Contents of Atmosphere(more than %1)
Context This dataset is a snapshot of the [OpenPowerlifting](http://www.openpowerlifting.org/index.html) database as of February 2018. OpenPowerlifting is an organization which tracks meets and competitor results in the sport of powerlifting, in which competitors complete to lift the most weight for their class in three separate weightlifting categories. Content This dataset includes two files. `meets.csv` is a record of all meets (competitions) included in the OpenPowerlifting database. `competitors.csv` is a record of all competitors who attended those meets, and the stats and lifts that they recorded at them. For more on how this dataset was collected, see the [OpenPowerlifting FAQ](http://www.openpowerlifting.org/faq.html). Acknowledgements This dataset is republished as-is from the [OpenPowerlifting source](http://www.openpowerlifting.org/data.html). Inspiration * How much influence does overall weight have on lifting capacity? * How big of a difference does gender make? What is demographic of lifters more generally?
Context This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories. Contents - Reviews.csv: Pulled from the corresponding SQLite table named Reviews in database.sqlite - database.sqlite: Contains the table 'Reviews' Data includes: - Reviews from Oct 1999 - Oct 2012 - 568,454 reviews - 256,059 users - 74,258 products - 260 users with > 50 reviews [](https://www.kaggle.com/benhamner/d/snap/amazon-fine-food-reviews/reviews-wordcloud) Acknowledgements See [this SQLite query](https://www.kaggle.com/benhamner/d/snap/amazon-fine-food-reviews/data-sample) for a quick sample of the dataset. If you publish articles based on this dataset, please cite the following paper: - J. McAuley and J. Leskovec. [From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews](http://i.stanford.edu/~julian/pdfs/www13.pdf). WWW, 2013.
Utility Data The data is extracted from [geonames][1], a very exhaustive list of worldwide toponyms. **It can be joined with datasets containing geographic fields to facilitate geospatial analysis including mapping.** This [datapackage][2] only lists cities above 15,000 inhabitants. Each city is associated with its country and subcountry to reduce the number of ambiguities. Subcountry can be the name of a state (e.g., in United Kingdom or the United States of America) or the major administrative section (e.g., ''region'' in France''). See `admin1` field on [geonames website][3] for further info about subcountry. Notice that: * Some cities like Vatican City or Singapore are a whole state so they don't belong to any subcountry. Therefore subcountry is `N/A`. * There is no guaranty that a city has a unique name in a country and subcountry (At the time of writing, there are about 60 ambiguities). But for each city, the source data primary key `geonameid` is provided. Preparation You can run the script yourself to update the data and publish them to GitHub/Kaggle: see [scripts README][4] Acknowledgments and License All data is licensed under the Creative Common Attribution License as is the original data from [geonames][5]. This means you have to credit [geonames][6] when using the data. And while no credit is formally required a link back or credit to [Lexman][7] and the [Open Knowledge Foundation][8] is much appreciated. *This dataset description is reproduced here from [its original source][9] with slight modifications.* [1]: http://www.geonames.org/ [2]: http://dataprotocols.org/data-packages/ [3]: http://www.geonames.org/ [4]: http://data.okfn.org/data/core/scripts/README.md [5]: http://www.geonames.org/ [6]: http://www.geonames.org/ [7]: http://github.com/lexman [8]: http://okfn.org/ [9]: http://data.okfn.org/data/core/world-cities
Here we provide four ArcGIS map packages with georeferenced files on the spatial distribution of Antarctic petrels, Adélie penguins (breeders and non-breeders) and Emperor penguins in the wider Weddell Sea (Antarctica), which were created in the context of the development of a marine protected area in the Weddell Sea.Antarctic petrel (Thalassoica antarctica): We approximated potential foraging habitats of T. antarctica according to existing literature by ice coverage from AMSR-E sea ice maps, bathymetric data from the International Bathymetric Chart of the Southern Ocean (IBCSO), and seawater temperature data from the Finite Element Sea Ice - Ocean Model (FESOM) provided by R. Timmermann (AWI). Subsequently, we combined our Antarctic petrel model with the kernel utilization distribution model from Descamps et al. (2016). The authors kindly provided us with shape files showing the kernel utilization summer and winter distribution of Antarctic petrel breeding at Svarthamaren. Breeding locations and estimated number of breeding pairs were taken from van Franeker et al. (1999). Favourable habitat conditions for Antarctic petrels were predicted for the Lazarev Sea and along the eastern coast of the Weddell Sea, particularly for the area off the Fimbul Ice Shelf and along the coast between approx. 15°E to 10°W within a water depth range from approx. 500 m to 2500 m.Breeding Adélie penguins (Pygoscelis adeliae): The map of potential foraging habitats of breeding P. adeliae is based on British Antarctic Survey (BAS) Inventory data from Phil Trathan (ID 754) and Mike Dunn and P. Trathan (ID 764, 773, 779), a dataset from BAS (P. Trathan) and Instituto Antártico Argentino (Mercedes Santos) (ID 753) and a dataset from the US AMLR Program from Jefferson Hinke and Wayne Trivelpiece (NOAA) (ID 910), which are stored in the Birdlife International's Seabird Tracking Database (data request: 20-10-2015). Suitable foraging habitats for breeding Adélies from colonies from which no tracking data were not available were approximated by a 50 km buffer and a 50-100 km ring buffer around each colony according to the recommendations of a CCAMLR MPA planning workshop. Breeding locations and estimated abundance of breeding pairs were taken from Lynch and LaRue (2014). The tracking data were processed with a state-space model described by Johnson et al. (2008) and were implemented in the R package crawl (Johnson 2011). Jefferson Hinke (NOAA) kindly provided us with support running the R script. Highly suitable foraging habitats occurred about 50 km away from the colonies on King Georg Island, the colony in Hope Bay (Graham Land) and the colonies on the South Orkney Islands.Non-breeding Adélie penguins (Pygoscelis adeliae): The map of potential foraging habitats of non-breeding P. adeliae is based on British Antarctic Survey (BAS) Inventory data from Phil Trathan (ID 754) and Mike Dunn and P. Trathan (ID 773, 779), a dataset from BAS (P. Trathan) and Instituto Antártico Argentino (Mercedes Santos) (ID 753) and a dataset from the US AMLR Program from Jefferson Hinke and Wayne Trivelpiece (NOAA) (ID 910), which are stored in the Birdlife International's Seabird Tracking Database (data request: 20-10-2015). The tracking data were processed with a state-space model described by Johnson et al. (2008) and were implemented in the R package crawl (Johnson 2011). Jefferson Hinke (NOAA) kindly provided us with support running the R script. Highest habitat utilisation was concentrated in relative small areas (e.g., close to King Georg Island). However, the non-breeding Adélies seemed to roam through large parts of the Weddell Sea.Emperor penguins (Aptenodytes forsteri): The probability map of A. forsteri occurrence was developed as a function of distance to colony and colony size from Fretwell et al. (2012, 2014) as well as from sea ice concentration from AMSR-E sea ice maps. Our model of emperor penguin foraging distribution during breeding season showed that the probability of occurrence is highest at the Halley and Dawson colony near Brunt Ice Shelf and at the Atka colony near Ekstrøm Ice Shelf.More information on the spatial analysis is given in working paper WG-EMM-16/03 and WG-SAM-17/30 (for T. antarctica) submitted to the CCAMLR Working Group on Ecosystem Monitoring and Management (EMM) and the CCAMLR Working Group on Statistics, Assessments and Modelling (SAM), respectively (available at https://www.ccamlr.org/en/wg-emm-16 and https://www.ccamlr.org/en/wg-sam-17).
Dataset used for correlation analyses and PLS.
These are some key metrics behind indoor and outdoor road cycling.
Context This is the modified version of BioGrid Homo Sapience dataset. I removed some columns from the original dataset in order to make it easier to use and understand. Content This dataset contains four columns. First two specifies an interaction between two proteins (Official Symbol Interactor A and Official Symbol Interactor B). The third column contains PMID of an article that describes an experiment that gives information about interactions. Fourth column contains information about throughput of an interaction. You can download the original version of the dataset [here](https://thebiogrid.org/).
The dataset is made up of full-body kinematics data of 6 subjects performing two main tasks: - walking alone (Solo Trial); - walking together with another subject while mechanically coupled through a stretcher-like object (Paired Trial). Data were collected by using the VICON system with 14 infrared bonita camera. Subjects were fully instrumented by 34 passive markers placed according to the Plug-In-Gait marker placement. The stretcher like object used to mechanically coupled a pair of subjects was also instrumented with 6 markers in order to detect its position during the trials. The experiment parameters and subject metadata are provided in the files Subject_Disposition.xlsx and Database_Subjects_CoMTrajectory.xlsx. The trajectory of each marker placed on each subject within a pair is stored in Pair*.mat while the trajectory and the velocity of the CoM for each subject evaluated at each gait cycle can be found in CoM_trajectories and CoM_velocities. In detail: - Pair*.mat: For each pair (*=1,2,3,...,7), the cells represent the trials performed by the pair. By opening one cell (trial) it is possible to find several fields. The relevant one is the ‘DATA’ field (hdl*{1,trial}.DATA) where it is possible to find several parameters related to the Subject in front of the stretcher (1) or behind it (2), For simplicity we will refer only to the Subject in front to explain the rest of the fields (same can be said for the Subject behind just by replacing 1 with 2 or A with B). The main field to be analyzed are: o hdl*{1,trial}.DATA.Pos1->it contains all the markers’ trajectories, related to markers on the right (R) or on the left (L) or in the center (C). o hdl*{1,trial}.DATA.CoMs01A.C.CoM->it contains CoM trajectory o hdl*{1,trial}.DATA.CoM_Table->it contains Table CoM trajectory. - CoM_trajectories: each field is a Subject that is identified by a number (pair to which he/she belongs) and a letter A/B to indicate whether the subject is in front of the stretcher (A) or behind (B). o COM_GCT.Subject**.Single->contains a nuber of cells corresponding to the total number of trials that the Subject** did during the ‘Solo Trial’. Since some of the 6 analyzed subjects can compare in different pairs or in the same pair but in different position, in order to map the subject inside a pair to the real subject (whose parameters are stored in Database_Subjects_CoM_Trajectory.xlsx) please refer to Subject_Disposition.xlsx. In each trial one can access to the COM traejectory along the forward direction; oCOM_GCT.Subject**.Coupled ->contains a nuber of cells corresponding to the total number of trials that the Subject** did during the ‘Paired Trial’; In each trial one can access to the COM traejectory along the forward direction; o COM_GCT.Subject**.Media->it is the mean CoM trajectory for both Single and Paired trials. - CoM_velocities: same scheme of CoM_trajectories.
This dataset contains macaque single unit recordings from the orbitofrontal cortices (subjects J and T), for stop signal task and economic choice task.The data is stored in .mat format, and has separate files for neural activity aligned to go signal and stop signal
Quanyin Hu et al. Dataset for [Conjugation of haematopoietic stem cells and platelets decorated with anti-PD-1 antibodies augments anti-leukaemia efficacy].
We propose adaptive incremental mixture Markov chain Monte Carlo (AIMM), a novel approach to sample from challenging probability distributions defined on a general state-space. While adaptive MCMC methods usually update a parametric proposal kernel with a global rule, AIMM locally adapts a semiparametric kernel. AIMM is based on an independent Metropolis–Hastings proposal distribution which takes the form of a finite mixture of Gaussian distributions. Central to this approach is the idea that the proposal distribution adapts to the target by locally adding a mixture component when the discrepancy between the proposal mixture and the target is deemed to be too large. As a result, the number of components in the mixture proposal is not fixed in advance. Theoretically, we prove that there exists a stochastic process that can be made arbitrarily close to AIMM and that converges to the correct target distribution. We also illustrate that it performs well in practice in a variety of challenging situations, including high-dimensional and multimodal target distributions. Finally, the methodology is successfully applied to two real data examples, including the Bayesian inference of a semiparametric regression model for the Boston Housing dataset. Supplementary materials for this article are available online.
This dataset is a subset of the Yelp Challenge, it contains all the reviews in the year of 2015
Context Stars mostly form in clusters and associations rather than in isolation. Milky Way star clusters are easily observable with small telescopes, and in some cases even with the naked eye. Depending on a variety of conditions, star clusters may dissolve quickly or be very long lived. The dynamical evolution of star clusters is a topic of very active research in astrophysics. Some popular models of star clusters are the so-called direct N-body simulations [1, 2], where every star is represented by a point particle that interacts gravitationally with every other particle. This kind of simulation is computationally expensive, as it scales as O(N^2) where N is the number of particles in the simulated cluster. In the following, the words "particle" and "star" are used interchangeably. Content This dataset contains the positions and velocities of simulated stars (particles) in a direct N-body simulation of a star cluster. In the cluster there are initially 64000 stars distributed in position-velocity space according to a King model [3]. Each .csv file named c_xxxx.csv corresponds to a snapshot of the simulation at time t = xxxx. For example, c_0000.csv contains the initial conditions (positions and velocities of stars at time t=0). Times are measured in standard N-body units [4]. This is a system of units where G = M = −4E = 1 (G is the gravitational constant, M the total mass of the cluster, and E its total energy). **x, y, z** Columns 1, 2, and 3 of each file are the x, y, z positions of the stars. They are also expressed in standard N-body units [4]. You can switch to units of the median radius of the cluster by finding the cluster center and calculating the median distance of stars from it, and then dividing x, y, and z by this number. In general, the median radius changes in time. The initial conditions are approximately spherically symmetric (you can check) so there is no particular physical meaning attached to the choice of x, y, and z. **vx, vy, vz** Columns 4, 5, and 6 contain the x, y, and z velocity, also in N-body units. A scale velocity for the stars can be obtained by taking the standard deviation of velocity along one direction (e.g. z). You may check that the ratio between the typical radius (see above) and the typical velocity is of order unity. **m** Column 7 is the mass of each star. For this simulation this is identically 1.5625e-05, i.e. 1/64000. The total mass of the cluster is initially 1. More realistic simulations (coming soon) have a spectrum of different masses and live stelar evolution, that results in changes in the mass of stars. This simulation is a pure N-body problem instead. **Star id number** The id numbers of each particle are listed in the last column (8) of the files under the header "id". The ids are unique and can be used to trace the position and velocity of a star across all files. There are initially 64000 particles. At end of the simulation there are 63970. This is because some particles escape the cluster. Acknowledgements This simulation was run on a Center for Galaxy Evolution Research (CGER) workstation at Yonsei University (Seoul, Korea), using the NBODY6 software (https://www.ast.cam.ac.uk/~sverre/web/pages/nbody.htm). Inspiration Some stars hover around the center of the cluster, while some other get kicked out to the cluster outskirts or even leave the cluster altogether. Can we predict where a star will be at any given time based on its initial position and velocity? Can we predict its velocity? How correlated are the motions of stars? Can we predict the velocity of a given star based on the velocity of its neighbours? The size of the cluster can be measured by defining a center (see below) and finding the median distance of stars from it. This is called the three-dimensional effective radius. Can we predict how it evolves over time? What are its properties as a time series? What can we say about other quantiles of the radius? How to define the cluster center? Just as the mode of a KDE of the distribution of stars? How does it move over time and how to quantify the properties of its fluctuations? Is the cluster symmetric around this center? Some stars leave the cluster: over time they exchange energy in close encounters with other stars and reach the escape velocity. This can be seen by comparing later snapshots with the initial one: some IDs are missing and there is overall a lower number of stars. Can we predict which stars are more likely to escape? When will a given star escape? References [1] Heggie, D., Hut, P. 2003, The Gravitational Million-Body Problem: A Multidisciplinary Approach to Star Cluster Dynamics ~ Cambridge University Press, 2003 [2] Aarseth, S.~J. 2003, Gravitational N-Body Simulations - Cambridge University Press, 2003 [3] King, I. 1966, AJ, 71, 64 [4] Heggie, D. C., Mathieu, R. D. 1986, Lecture Notes in Physics, Vol. 267, The Use of Supercomputers in Stellar Dynamics, Berlin, Springer
Context Invasive Ductal Carcinoma (IDC) is the most common subtype of all breast cancers. To assign an aggressiveness grade to a whole mount sample, pathologists typically focus on the regions which contain the IDC. As a result, one of the common pre-processing steps for automatic aggressiveness grading is to delineate the exact regions of IDC inside of a whole mount slide. Content The original dataset consisted of 162 whole mount slide images of Breast Cancer (BCa) specimens scanned at 40x. From that, 277,524 patches of size 50 x 50 were extracted (198,738 IDC negative and 78,786 IDC positive). Each patch’s file name is of the format: u_xX_yY_classC.png — > example 10253_idx5_x1351_y1101_class0.png . Where u is the patient ID (10253_idx5), X is the x-coordinate of where this patch was cropped from, Y is the y-coordinate of where this patch was cropped from, and C indicates the class where 0 is non-IDC and 1 is IDC. Acknowledgements The original files are located here: http://gleason.case.edu/webdata/jpi-dl-tutorial/IDC_regular_ps50_idx5.zip Citation: https://www.ncbi.nlm.nih.gov/pubmed/27563488 and http://spie.org/Publications/Proceedings/Paper/10.1117/12.2043872 Inspiration Breast cancer is the most common form of cancer in women, and invasive ductal carcinoma (IDC) is the most common form of breast cancer. Accurately identifying and categorizing breast cancer subtypes is an important clinical task, and automated methods can be used to save time and reduce error.
«Datasets per la comparació de moviments i patrons entre els principals índexs borsatils espanyols i les crypto-monedes» Context En aquest cas el context és detectar o preveure els diferents moviments que es produeixen per una serie factors, tant de moviment interns (compra-venda), com externs (moviments polítics, econòmics, etc...), en els principals índexs borsatils espanyols i de les crypto-monedes. Hem seleccionat diferents fonts de dades per generar fitxers «csv», guardar diferents valors en el mateix període de temps. És important destacar que ens interessa més les tendències alcistes o baixes, que podem calcular o recuperar en aquests períodes de temps. Content En aquest cas el contingut està format per diferents csv, especialment tenim els fitxers de moviments de cryptomoneda, els quals s’ha generat un fitxer per dia del període de temps estudiat. Pel que fa als moviments del principals índexs borsatils s’ha generat una carpeta per dia del període, en cada directori un fitxer amb cadascun del noms dels índexs. Degut això s’han comprimit aquests últims abans de publicar-los en el directori de «open data» kaggle.com. Pel que fa als camps, ens interessà detectar els moviments alcistes i baixistes, o almenys aquelles que tenen un patró similar en les cryptomonedes i els índexs. Els camps especialment destacats són: • Nom: Nom empresa o cryptomoneda; • Preu: Valor en euros d’una acció o una cryptomoneda; • Volum: En euros/volum 24 hores,acumulat de les transaccions diàries en milions d’euros • Simbol: Símbol o acrònim de la moneda • Cap de mercat: Valor total de totes les monedes en el moment actual • Oferta circulant: Valor en oportunitat de negoci • % 1h, % 2h i %7d, tant per cent del valor la moneda en 1h, 2h o 7d sobre la resta de cyprtomonedes. Acknowledgements En aquest cas les fonts de dades que s’han utilitzat per a la realització dels datasets corresponent a: - http://www.eleconomista.es - https://coinmarketcap.com Per aquest fet, les dades de borsa i crypto-moneda estan en última instància sota llicència de les webs respectivament. Pel que fa a la terminologia financera podem veure vocabulari en renta4banco. [https://www.r4.com/que-necesitas/formacion/diccionario] Inspiration Hi ha un estudi anterior on poder tenir primícies de com han enfocat els algoritmes: - https://arxiv.org/pdf/1410.1231v1.pdf En aquest cas el «trading» en cryptomoneda és relativament nou, força popular per la seva formulació com a mitja digital d’intercanvi, utilitzant un protocol que garanteix la seguretat, integritat i equilibri del seu estat de compte per mitjà d’un entramat d’agents. La comunitat podrà respondre, entre altres preguntes, a: - Està afectant o hi ha patrons comuns en les cotitzacions de cryptomonedes i el mercat de valors principals del país d'Espanya? - Els efectes o agents externs afecten per igual a les accions o cryptomonedes? - Hi ha relacions cause efecte entre les acciones i cryptomonedes? Project repository https://github.com/acostasg/scraping Datasets Els fitxers csv generats que componen el dataset s’han publicat en el repositori kaggle.com: * https://www.kaggle.com/acostasg/stock-index/ * https://www.kaggle.com/acostasg/crypto-currencies Per una banda, els fitxers els «stock-index» estan comprimits per carpetes amb la data d’extracció i cada fitxer amb el nom dels índexs borsatil. De forma diferent, les cryptomonedes aquestes estan dividides per fitxer on són totes les monedes amb la data d’extracció.
Comparisons of this dataset to muscle and neural expression datasets. Included in this file are: 1- total muscle enriched genes [32] that are also SGP-biased or hmc-biased, 2- larval pan-neural enriched genes [31] that are also SGP-biased or hmc-biased, 3- genes that are expressed in muscle, neuron, and hmc, 4- GO terms for genes that are expressed in muscle and hmc. 5- GO terms for genes that are expressed in neuron and hmc. 6- genes that are involved in the synaptic vesicle cycle [59], 7- genes that encode components of thin and thick filaments of body wall muscle [60], and 8- genes that encode FMRF-like and insulin-like peptides. (XLSX 159 kb)
The dataset described in https://arxiv.org/abs/1809.01574DOI: 10.13140/RG.2.2.15252.76161/1Approx. 1000 entities for stance detection in Russian.
Context Approximately 10 people are shot on an average day in Chicago. http://www.chicagotribune.com/news/data/ct-shooting-victims-map-charts-htmlstory.html http://www.chicagotribune.com/news/local/breaking/ct-chicago-homicides-data-tracker-htmlstory.html http://www.chicagotribune.com/news/local/breaking/ct-homicide-victims-2017-htmlstory.html Content This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department\'s CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. This data includes unverified reports supplied to the Police Department. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. Update Frequency: Daily Fork [this kernel][1] to get started. Acknowledgements https://bigquery.cloud.google.com/dataset/bigquery-public-data:chicago_crime https://cloud.google.com/bigquery/public-data/chicago-crime-data Dataset Source: City of Chicago This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source —https://data.cityofchicago.org — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Banner Photo by [Ferdinand Stohr from Unplash][2]. Inspiration What categories of crime exhibited the greatest year-over-year increase between 2015 and 2016? Which month generally has the greatest number of motor vehicle thefts? How does temperature affect the incident rate of violent crime (assault or battery)? [1]: https://www.kaggle.com/paultimothymooney/starter-kernel-for-chicago-crime-dataset [2]: https://unsplash.com/photos/EK8DxK_7IwY  https://cloud.google.com/bigquery/images/chicago-scatter.png
This is the Dataset for paper titled: Strong Rotational Anisotropies Affect Nonlinear Chiral Metamaterials Produced by David Hooper (d.c.hooper@bath.ac.uk) and Joel Collins (j.collins@bath.ac.uk) • The folders TI-539A and TI-540A contain SHG continuous polarization data for the left- and right-handed nanohelices, respectively. The data found in these folders is used to produce figures 2c-f, 3, 4 in the manuscript and figures S2-S6 in the supporting information. o The subfolders are split into measurements performed at normal incidence and 45 degrees incidence. \uf0a7 The subfolders give information about the sample geometry such as the polarizer-analyzer configuration and which parts were rotated. • The data files are then named such as 20160218_Pol_0_Ana_0_Sample_0_QWP_0to360Step5_17h07m18s which should be read as: DateStamp_PolarizerAngle_AnalyzerAngle_SampleAngle_QWP_Range&Step_TimeStamp o There are two columns within each data file. The 1st column contains the angle (in degrees) of the Quarter-wave plate (QWP) over the range and in steps given in the file name. The 2nd column contains the SHG counts per second recorded by the photon counting system. (The zero degree angle means that the fast axis of a component is horizontal with respect to the bench. 90 degrees then means the fast axis is vertical (normal to the bench). The angle of the components (not sample) is recorded from the frame of reference looking against the direction of propagation, zero on the left-hand side horizontal and the fast axis rotated clockwise. This information should be enough to reconstruct the polarization state incident on the sample.) • The folder “Linear Polarization Anisotropy P-in P-out TI540A” contains the data for the continuous polarization measurement displayed in Figure 2a of the paper. These measurements are performed at 45 degrees incidence. o The file format is the same as explained above. • The “Linear Spectrum” folder contains the data for figures 2b and S1. It contains its own read me file.
Talk given during the "Harmonise This! Analyzing Diverse Neuroimaging Datasets" workshop at the 2015 Organization for Human Brain Mapping (OHBM) conference in Hawaii, 14-18 June.
This is a sample dataset that includes 29,437 full text articles for testing SparkText (SparkText: biomedical text mining on big data framework).
Number of papers per category for ten key entropy concepts. The concepts were selected according to their frequency of appearances in all abstracts in our dataset.
Dataset title Sales of shampoo over a three year period Last updated 1 Feb 2014, 19:52 Last updated by source 20 Jun 2012 Provider Time Series Data Library Provider source Makridakis, Wheelwright and Hyndman (1998) Source URL http://datamarket.com/data/list/?q=provider:tsdl Units Dataset metrics 36 fact values in 1 timeseries. Time granularity Month Time range Jan 1 – Dec 3 Language English License Default open license License summary This data release is licensed as follows: You may copy and redistribute the data. You may make derivative works from the data. You may use the data for commercial purposes. You may not sublicense the data when redistributing it. You may not redistribute the data under a different license. Source attribution on any use of this data: Must refer source. Description Sales, Source: Makridakis, Wheelwright and Hyndman (1998), in file: data/shampoo, Description: Sales of shampoo over a three year period
Archive of python scripts used in data analysis. See Readme in that archive and COMMANDS.txt
The Rooftop Energy Potential of Low Income Communities in America REPLICA data set provides estimates of residential rooftop solar technical potential at the tract-level with emphasis on estimates for Low and Moderate Income LMI populations. In addition to technical potential REPLICA is comprised of 10 additional datasets at the tract-level to provide socio-demographic and market context. The model year vintage of REPLICA is 2015. The LMI solar potential estimates are made at the tract level grouped by Area Median Income AMI income tenure and building type. These estimates are based off of LiDAR data of 128 metropolitan areas statistical modeling and ACS 2011-2015 demographic data. The remaining datasets are supplemental datasets that can be used in conjunction with the technical potential data for general LMI solar analysis planning and policy making. The core dataset is a wide-format CSV file seeds_ii_replica.csv that can be tagged to a tract geometry using the GEOID or GISJOIN fields. In addition users can download geographic shapefiles for the main or supplemental datasets. This dataset was generated as part of the larger NREL-led SEEDSII Solar Energy Evolution and Diffusion Studies project and specifically for the NREL technical report titled Rooftop Solar Technical Potential for Low-to-Moderate Income Households in the United States by Sigrin and Mooney 2018. This dataset is intended to give researchers planners advocates and policy-makers access to credible data to analyze low-income solar issues and potentially perform cost-benefit analysis for program design. To explore the data in an interactive web mapping environment use the NREL SolarForAll app.
List and characteristics of the different public transcriptomic datasets from ovarian cancer used to establish the potential impact of CDR2 and CDR2L expression on overall survival.Stage column is early/late/unknown. Histology column is ser/clearcell/endo/mucinous/other/unknown
The datasets contains information on empathic accuracy, personality, and health status of chronic pain patients and informal caregivers
Context Indian Hindi Cinema, popularly known as Bollywood has witnessed exponential growth in terms of volume of business, manpower employed, number of movies produced each year and also the global reach. Hence, it could be of great commercial importance to develop a model which could predict the success of a movie before it's release. However, it is not easy to forecast demand for a movie. There are a number of factors like Actors, Directors, Time of Release, Genre, Production house etc. which affect the outcome of a movie. The primary requirement to develop such a model would be the availability of Bollywood movie data. Thus, I created this dataset while working on my senior year research project, titled 'Predicting success of upcoming Bollywood movies'. Content The data has been created manually by visiting different websites. The primary ones being Wikipedia, boxofficeindia.com and IMDB. The data contains 1285 rows with movies released between the years 2001 to 2014. The hitFlop column contains values from 1 to 9 with 1 - Disaster 2 - Flop 3 - Below Average 4 - Average 5 - Semi Hit 6 - Hit 7 - Super Hit 8 - Blockbuster 9 - All-Time Blockbuster Acknowledgements Research Guide - Dr. S.K. Saha Inspiration Can we save the time and money wasted by movie viewers on viewing flop and disaster movies? Can we suggest must-watch movies to movie viewers even before movies release? Can we classify upcoming movies into 1 of 9 categories even before their release?
Context These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages. This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website. Content This dataset consists of the following files: **movies_metadata.csv:** The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies. **keywords.csv:** Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object. **credits.csv:** Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object. **links.csv:** The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset. **links_small.csv:** Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset. **ratings_small.csv:** The subset of 100,000 ratings from 700 users on 9,000 movies. The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed [here](https://grouplens.org/datasets/movielens/latest/) Acknowledgements This dataset is an ensemble of data collected from TMDB and GroupLens. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself [here](https://www.themoviedb.org/documentation/api). The Movie Links and Ratings have been obtained from the Official GroupLens website. The files are a part of the dataset available [here](https://grouplens.org/datasets/movielens/latest/)  Inspiration This dataset was assembled as part of my second Capstone Project for Springboard's [Data Science Career Track](https://www.springboard.com/workshops/data-science-career-track). I wanted to perform an extensive EDA on Movie Data to narrate the history and the story of Cinema and use this metadata in combination with MovieLens ratings to build various types of Recommender Systems. Both my notebooks are available as kernels with this dataset: [The Story of Film](https://www.kaggle.com/rounakbanik/the-story-of-film) and [Movie Recommender Systems](https://www.kaggle.com/rounakbanik/movie-recommender-systems) Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content Based and Collaborative Filtering Based Recommendation Engines.
A synthetic dataset containing the fighting rate values for each of the familial lineages that were represented in the male combinations for every all unrelated vial; this was accomplished using our observations made for males from the appropriate corresponding males related vials. The three values (corresponding to the three males in an unrelated trio) in this synthetic dataset were ranked from most aggressive to least aggressive. The fighting rate in the corresponding "all unrelated" vial is also included.
This folder contains datasets and codes used in this paper:ACTINN: Automated identification of Cell Types in Single Cell RNA Sequencing
This is a THz Security Image Dataset. Possible researches on this dataset may include the development of THz quality standards, the selection of the best display mode, the enhancement of images, the modeling of image noise, and the detection of prohibited goods (include groundtruth in "data" file).If you have any questions, you can send a request to humenghan@sjtu.edu.cnPlease cite the following paper if you wish to use our dataset: Menghan Hu, Guangtao Zhai, Rong Xie, Xiongkuo Min, Qingli Li, Xiaokang Yang, Wenjun Zhang, "A Wavelet-Predominant Algorithm can Evaluate Quality of THz Security Image and Identify its Usability," IEEE Transactions on Broadcasting, 2019, accepted.
Brain, Object, Landscape DatasetVision science - particularly machine vision - is being revolutionized by large-scale datasets. State-of-the-art artificial vision models critically depend on large-scale datasets to achieve high performance. In contrast, although large-scale learning models (e.g., AlexNet) have been applied to human neuroimaging data, the stimuli for such neuroimaging experiments include significantly fewer images. The small size of these stimulus sets also translates to limited image diversity. Here we dramatically increase the stimulus set size deployed in an fMRI study of visual scene processing. We scanned four participants in a slow-evented related design that incorporated 4,916 unique scenes. Data was collected over 16 sessions, 15 of which were task-related sessions, plus an additional session for acquiring high resolution anatomical scans. In 8 of the 15 task-related sessions, a functional localizer was run in order to independently define scene-selective cortex. In each scanning session, participants filled out a questionnaire (Daily Intake) about their daily routine, including: current status regarding food and beverage intake, sleep, exercise, ibuprofen, and comfort in the scanner. During BOLD scanning, physiological data (heart rate and respiration) was also acquired.The experiment including 4,803 images presented on a single trial throughout the experiment, and 112 images repeated four times, and one image repeated three times, throughout the experiment, yielding a total of 5,254 stimuli trials. The stimuli were drawn from three datasets: 1) 1000 images from Scene Images (250 scene categories, based on SUN categories, with four exemplars each); 2) 2000 images from the COCO dataset; and 3) 1916 images from the ImageNet dataset. In the experiment, images were presented for 1 second, with 9 seconds of fixation between trials. Participants were asked to judge whether they liked, disliked, or were neutral about the image.In sum, our dataset is unique in three ways: it is 1) significantly larger than existing slow-event neural datasets by an order of magnitude, 2) extremely diverse in stimuli, 3) considerably overlapping with existing computer vision datasets. Our large-scale dataset enables novel neural network training and novel exploration of benchmark computer vision datasets through neuroscience. Finally, the scale advantage of our dataset and the use of a slow event-related design enables, for the first time, joint computer vision and fMRI analyses that span a significant and diverse region of image space using high-performing models. Please refer to our website for more details and future news and releases: BOLD5000.org arXiv preprint in references below: https://arxiv.org/abs/1809.01281v2: Added BOLD5000_ROIs.zip (9/7/18)v3: Added BOLD5000_MRI-Protocols.zip (9/11/18)v4: Added Austin Marcus as author and image stimuli files moved to a different location (see bold5000.org).
**Context:** Daily horse racing (thoroughbred) information that has(is) being actively collected and aggregated from a variety of sources. Years covered are just 2016, country is irrelevant to the dataset. **Acknowledgements:** This data has(is) being actively collected and aggregated from a variety of sources, all in the public domain. **Past Research:** None of merit, data is used currently to influence some betting decisions but no solid machine learning model(s) have been developed. Have thrown various versions of the data into: - Google Prediction - Amazon Machine Learning - Azure Machine Learning - Watson Analytics as a way to learn how these systems work. **Inspiration:** Probably one of the hardest things to do is pick stocks and horses. I have been involved in the stocks and horses industry for many years and through publishing previous libraries and software I have met many interesting people and also one of my long term clients/friends. I am currently trying enhance my software development skills by learning data science / machine learning. I have a done a few tutorials and I am hoping that by publishing this data I can learn and collaborate with members of the Kaggle Community. **Content:** **markets.csv** - id - start_time - what time did the race start, datetime in UTC - venue_id - race_number - distance(m) - condition_id - track condition, see conditions.csv - weather_id - weather on day, see weathers.csv - total_pool_win_one - rough $ amount wagered across all runners for win market - total_pool_place_one - rough $ amount wagered across all runners for place market - total_pool_win_two - total_pool_place_two - total_pool_win_three - total_pool_place_three **runners.csv** - id - collected - what time was this row created/data collected, datetime in UTC - market_id - position - **THIS IS THE FIELD WE WANT TO PREDICT!!!!** - Will either be 1,2,3,4,5,6 etc or 0/null if the horse was scratched or failed to finish - If all positions for a market_id are null it means we were unable to match up the positional data for this market - place_paid - Will either be 1/0 or null - If you see a race that only has 2 booleans of 1 it means that the race only paid out places on the first two positions - margin - If the runner didnt win, how many lengths behind the 1st place was it - horse_id - see horses.csv - trainer_id - rider_id - see riders.csv - handicap_weight - number - barrier - blinkers - emergency - did it come into the race at the last minute - form_rating_one - form_rating_two - form_rating_three - last_five_starts - favourite_odds_win - from one of the odds sources, will it win - true/false - favourite_odds_place - from one of the odds sources, will it win - true/false - favourite_pool_win - favourite_pool_place - tip_one_win - from a tipster, will it win - true/false - tip_one_place - from a tipster, will it place - true/false - tip_two_win - tip_two_place - tip_three_win - tip_three_place - tip_four_win - tip_four_place - tip_five_win - tip_five_place - tip_six_win - tip_six_place - tip_seven_win - tip_seven_place - tip_eight_win - tip_eight_place - tip_nine_win - tip_nine_place **odds.csv (collected for every runner 10 minutes out from race start until race starts)** - runner_id - collected - what time was this row created/data collected, datetime in UTC - odds_one_win - from odds source, win odds - odds_one_win_wagered - from odds source, rough $ amount wagered on win - odds_one_place - from odds source, place odds - odds_one_place_wagered - from odds source, rough $ amount wagered on place - odds_two_win - odds_two_win_wagered - odds_two_place - odds_two_place_wagered - odds_three_win - odds_three_win_wagered - odds_three_place - odds_three_place_wagered - odds_four_win - odds_four_win_wagered - odds_four_place - odds_four_place_wagered **forms.csv** - collected - what time was this row created/data collected, datetime in UTC - market_id - horse_id - runner_number - last_twenty_starts - `e.g. f9x726x753x92222x35` - f = failed to finish, 7 = finished 7th, 6 = finished 6th, 7 = finished 7th, x = runner was scratched - class_level_id - 1 = eq (in same class as other horses) - 2 = up (up in class) - 3 = dn (down in class) - field_strength - days_since_last_run - runs_since_spell - overall_starts - overall_wins - overall_places - track_starts - track_wins - track_places - firm_starts - firm_wins - firm_places - good_starts - good_wins - good_places - dead_starts - dead_wins - dead_places - slow_starts - slow_wins - slow_places - soft_starts - soft_wins - soft_places - heavy_starts - heavy_wins - heavy_places - distance_starts - distance_wins - distance_places - class_same_starts - class_same_wins - class_same_places - class_stronger_starts - class_stronger_wins - class_stronger_places - first_up_starts - first_up_wins - first_up_places - second_up_starts - second_up_wins - second_up_places - track_distance_starts - track_distance_wins - track_distance_places **conditions.csv** - id - name **weathers.csv** - id - name **riders.csv (jockeys)** - id - sex **horses.csv** - id - age - sex_id - see horse_sexes.csv - sire_id - not related to horses.id, there is another table called horse_sires that is not present here - dam_id - not related to horses.id, there is another table called horse_dams that is not present here - prize_money - total aggregate prize money **horse_sexes.csv** - id - name
The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research.
Raw data for the 15/2 pollen surface sample dataset obtained from the Neotoma Paleoecological Database.
Citizen science is a participatory research practice whereby members of the public contribute to research through sensing, collecting and analysing data. Often citizen science is facilitated by the internet and digital technology including apps and web based games. There are many examples of citizen science initiatives across many disciplines, including projects that address societal or environmental challenges. The rise of citizen science and the increasing use of interactive and emerging technologies to collect, analyse and share data presents new opportunities and challenges for researchers, their institutions and the public creators of these datasets. This presentation looks to the future of publicly engaged research practices and encourages speculation around the challenges, opportunities for impact and potential innovations when data becomes playable and social.
This dataset gives us information about the things people purchase when they go to a shop. The raws are the people who buy specific things when they go to a shop. By looking at the patterns what they buy, we can get an understanding to reorder the things in the shop in a better way to help people feel more convenient in finding what they want in an easier way!
This dataset contains key characteristics about the data described in the Data Descriptor Reference gene set and small RNA set construction with multiple tissues from Davidia involucrata Baill.. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format3. machine readable metadata file in ISA-Tab format (zipped folder)
This is a THz Security Image Dataset. Possible researches on this dataset may include the development of THz quality standards, the selection of the best display mode, the enhancement of images, the modeling of image noise, and the detection of prohibited goods (include groundtruth in "data" file).If you have any questions, you can send a request to humenghan@sjtu.edu.cnPlease cite the following paper if you wish to use our dataset: Menghan Hu, Guangtao Zhai, Rong Xie, Xiongkuo Min, Qingli Li, Xiaokang Yang, Wenjun Zhang, "A Wavelet-Predominant Algorithm can Evaluate Quality of THz Security Image and Identify its Usability," IEEE Transactions on Broadcasting, 2019, accepted.
Orthologous groups predicted by OrthoMCL. Datasets included: 35 Sceloporus species, plus the Anolis carolinensis, Gallus gallus, and human proteins.
Context The city of Seattle makes available its database of pet licenses issued from 2005 to the beginning of 2017 as part of the city's ongoing [Open Data Initiative](https://data.seattle.gov/). The data is also obtainable from the [Socrata Open Data Access (SODA)](https://data.seattle.gov/Community/Seattle-Pet-Licenses/jguv-t9rb) portal in either CSV or JSON formats. It is also made available here (unofficially, I have no official affiliation with the city of Seattle or the Seattle Animal Shelter) to help spread awareness of the dataset and Seattle's Pet Licensing initiative. Content Seattle Pet Licenses Dataset The data set contains information on licenses issued as far back as 2005 to the end of January 2017. **Dataset Columns:** * License Issue Date: Floating Timestamp - Date and time of when the pet license was issued. * License Number: Integer - Unique ID for each issued license. * Animal's Name: String - Name of the licensed pet. * Species: String - Species of the licensed pet. Will be either 'Dog,' 'Cat,' or 'Livestock.' * Primary Breed: String - Primary breed of the licensed pet. * Secondary Breed: String - Secondary breed (if any) of the licensed pet. Washington Zip Codes Tax Returns by Income Bracket As part of an analysis done to see if there is a relationship between the volume of pet licenses and the affluence of the particular area, the data also includes the [Statistics of Income 2015 dataset](https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2015-zip-code-data-soi) that features the number of tax returns received by the IRS from each Seattle zip code broken out by several income brackets. The uploaded data represents a clean set of data for analysis use. Acknowledgements The Seattle Pet Licenses dataset was compiled by the City of Seattle Department of Finance and Administrative Services through the city of Seattle's Open Data initiative, and all credit goes to the original creators and maintainers of the data, the [Seattle Animal Shelter](http://www.seattle.gov/animalshelter). I am merely trying to make the data available to a broader audience to help spread awareness. The [Statistics of Income (SOI) dataset](https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi) is owned and maintained by the IRS. The data presented here is a clean representation of the Washington Zip Code SOI 2015 dataset. Inspiration The dataset shows there were almost no pets licensed from 2005 up until mid-2014 when volume began rising drastically, for reasons as yet unknown (in that I wasn't able to find any sources mentioning any news that would cause such a significant increase). There also appears to be a massive disparity in the number of dogs licensed compared to cats, even though there are approximately 5 million more owned cats in the United States over dogs. Thus, I hope that by making this data more available, users who analyze the data can find insights and recommendations for the Seattle Animal Shelter to increase pet licensing numbers and help show pet owners who haven't licensed their pets why it is essential. Extra An [analysis of the Seattle Pet Licenses dataset](https://aaronschlegel.me/extract-analyze-seattle-pet-licenses-dataset.html) with Python can also be found on my website. About Seattle Pet Licenses The [city of Seattle requires pets over eight weeks old be licensed](https://library.municode.com/wa/seattle/codes/municipal_code?nodeId=TIT9AN_CH9.25ANCO_9.25.050ANLIPEGE). There are several benefits to [licensing one's pet](https://www.seattle.gov/animal-shelter/license), including a return ride home if your pet is lost, and easier contact from a veterinarian if your pet is unfortunately injured. If the licensing is performed at the Seattle Animal Shelter on the third Saturday of any given month, a free rabies vaccine is included, as well as other vaccines and a microchip for a small additional fee.
The 100 trees used as phylogenetic hypotheses in this study were subsets of those developed by Jetz et al. (2012) and available through birdtree.org. Several species are represented by more than one record in this dataset, thus phylogeny tips for those taxa were transformed into multichotomies (i.e., polytomies) in each phylogeny. Here, 101 phylogenies are included. Models using the 74th phylogeny failed to converge so an additional phylogeny (i.e., phylogeny 101) was added to complete a set of 100 for the analyses. For all phylogenies the tip labels are coded to match the "record_id" field code in the corresponding dataset.
The objective of the study is to provide global grids (0.5°) of revised annual coefficients for the Priestley-Taylor (P-T) and Hargreaves-Samani (H-S) evapotranspiration methods after calibration based on the ASCE (American Society of Civil Engineers)-standardized Penman-Monteith method (the ASCE method includes two reference crops: short-clipped grass and tall alfalfa). The analysis also includes the development of a global grid of revised annual coefficients for solar radiation (Rs) estimations using the respective Rs formula of H-S. The analysis was based on global gridded climatic data of the period 1950-2000. The method for deriving annual coefficients of the P-T and H-S methods was based on partial weighted averages (PWAs) of their mean monthly values. This method estimates the annual values considering the amplitude of the parameter under investigation (ETo and Rs) giving more weight to the monthly coefficients of the months with higher ETo values (or Rs values for the case of the H-S radiation formula). The method also eliminates the effect of unreasonably high or low monthly coefficients that may occur during periods where ETo and Rs fall below a specific threshold. The new coefficients were validated based on data from 140 stations located in various climatic zones of the USA and Australia with expanded observations up to 2016. The validation procedure for ETo estimations of the short reference crop showed that the P-T and H-S methods with the new revised coefficients outperformed the standard methods reducing the estimated root mean square error (RMSE) in ETo values by 40 and 25 %, respectively. The estimations of Rs using the H-S formula with revised coefficients reduced the RMSE by 28 % in comparison to the standard H-S formula. Finally, a raster database was built consisting of (a) global maps for the mean monthly ETo values estimated by ASCE-standardized method for both reference crops, (b) global maps for the revised annual coefficients of the P-T and H-S evapotranspiration methods for both reference crops and a global map for the revised annual coefficient of the H-S radiation formula and (c) global maps that indicate the optimum locations for using the standard P-T and H-S methods and their possible annual errors based on reference values. The database can support estimations of ETo and solar radiation for locations where climatic data are limited and it can support studies which require such estimations on larger scales (e.g. country, continent, world). The datasets produced in this study are archived in the PANGAEA database (this data set) and in the ESRN database (http://www.esrn-database.org or http://esrn-database.weebly.com).
Context Disk space captured for several months for a set of Windows servers Content Contents are the server name, disk drive, total disk space, free disk space and percentage of free space Acknowledgements Thanks to Kaggle for providing this development environment Inspiration My initial goal is to add a column with the moving average of free disk space (for 7 days), to be used for forcasting
A set of preference judgements among generated random property pairs for 350 random Wikidata persons. For each (entity, property1, property2) record, 10 annotators judged which of the two properties is more interesting for the respective entity. The goal is then to predict the annotator judgments as good as possible. Current state-of-the-art methods (Wikidata Property Suggester and others) achieve 61% precision in this task, while methods based on linguistic similarity get to 74%, still significantly below annotator agreement (87.5%). Further details are in the paper "Doctoral Advisor or Medical Condition: Towards Entity-specific Rankings of Knowledge Base Properties", ADMA 2017, available at http://www.simonrazniewski.com/2017_ADMA.pdf
Dataset connected with article: 'Increasing the use of conceptually-derived strategies in arithmetic: using inversion problems to promote the use of associativity shortcuts.' Abstract: Conceptual knowledge of key principles underlying arithmetic is an important precursor to understanding algebra and later success in mathematics. One such principle is associativity, which allows individuals to solve problems in different ways by decomposing and recombining subexpressions (e.g. ‘a + b – c’ = ‘b – c + a’). More than any other principle, children and adults alike have difficulty understanding it, and educators have called for this to change. We report three intervention studies that were conducted in university classrooms to investigate whether adults’ use of associativity could be improved. In all three studies, it was found that those who first solved inversion problems (e.g. ‘a + b – b’) were more likely than controls to then use associativity on ‘a + b – c’ problems. We suggest that ‘a + b – b’ inversion problems may either direct spatial attention to the location of ‘b – c’ on associativity problems, or implicitly communicate the validity and efficiency of a right-to-left strategy. These findings may be helpful for those designing brief activities that aim to aid the understanding of arithmetic principles and algebra.
If a change of use for industrial land is proposed in the UK, there is usually a requirement to demonstrate that the change of use will not result in the land becoming Contaminated Land, as defined under the Environmental Protection Act 1990. Under certain circumstances, this demonstration can be made by showing that the mean concentration of contaminants of potential concern is below a suitable assessment level appropriate to the proposed new use. How much sampling effort is required for this purpose? Using a relatively large dataset for arsenic in soil, a developed approach is presented to determining the number of measurements required for a clearance investigation to demonstrate absence of contamination based on minimizing expectation of financial loss, taking into account both the actual cost of investigation and the possible cost of incorrectly determining that contamination is still present and undertaking unnecessary remediation. Abstract probabilities are discussed in terms of money spent and money potentially saved.
Flowering time data from 2003 study in the greenhouse, performed at the Kellogg Biological Station near Kalamazoo MI.Populations: 9Individuals per pop: 22-46 Total individuals in dataset: 306 These are also the parental plants from Sahli et al., (2008.
dataset 4, Pterostichus Melanarius (biological variable) x Corn (landscape variable)
Dataset for paired data using the Neurophysiology of Pain Questionnaire and the HC-PAIRS
**************** NTU Dataset ReadMe file *******************We had to remove our data temporarily for privacy reasons.
This dataset summarizes the information for 11 journals, including the year for which the p-values of the published articles were extracted, the impact factor for the respective year and the acronym used in the R code for this study.
250 Benchmark queries for the DBpedia dataset
Content More details about each file are in the individual file descriptions. Context This is a dataset hosted by the City of New York. The city has an open data platform found [here](https://opendata.cityofnewyork.us/) and they update their information according the amount of data that is brought in. Explore New York City using Kaggle and all of the data sources available through the City of New York [organization page](https://www.kaggle.com/new-york-city)! * Update Frequency: This dataset is updated annually. Acknowledgements This dataset is maintained using Socrata's API and Kaggle's API. [Socrata](https://socrata.com/) has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public. This dataset is distributed under the following licenses: Public Domain
Ausgehend vom Verkehrsunfaelle_train Datensatz soll Dein Algorithmus in der Lage sein, die Unfallschwere (leicht, schwer, tödlich) eines Verkehrsunfalls zu prädizieren. Du erhältst auch einen zweiten Datensatz (Verkehrsunfaelle_test.csv), der verwendet wird, um die Vorhersage-Performance Ihres Algorithmus zu validieren. Dazu verwendest Du Deinen Algorithmus und reichst die Prädiktionen im .csv-format ein. Das file muss exakt 2 Spalten und 1000 Reihen plus eine headerreihe mit Unfall_ID und Unfallschwere besitzen. Die erste Spalte beinhaltet die ID des Unfalls in aufsteigender Nummerierung, die zweite die prädizierte Unfallschwere (1 = leicht, 2 = schwer, 3 = tödlich).
Sex classification dataset from [Wikipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifierSex_classification), for the purpose of Naive Bayes classifier demonstration.
Individual diagnosis history in the dataset cohort.This dataset support the manuscript "WAHDA: a Data Source to Promote the Impact of Checkup in Better Quality of Care", submitted to Scientific Data
Values of Jaccard dissimilarity based on incidence datasets measured within and between communities. Identical = 0; completely dissimilar = 1. Abbreviations: MT, morphotypes; EE, evolutionary independent entities obtained with GMYC; OTU, operational taxonomic units obtained with single individuals; eOTU, operational taxonomic units obtained with environmental samples; SV, sequence variants. L, littoral; SL, sublittoral; O, offshore. Dissimilarity values were estimated using both the focal phyla (SV; eOTU) and whole meiofauna dataset (SV2; eOTU2).
The dataset consists of respiration and methane production rates and methane oxidation potential obtained from soil microcosm studies carried out under controlled temperature and incubation conditions. Soils cores collected in 2012 represent the flat- and high-centered polygon active layers and permafrost (when present) from the NGEE Arctic Intensive Study Site 1, Barrow, Alaska.
Subgrid variability introduces non-negligible scale effects on the GIS-based representation of snow. This heterogeneity is even more evident in semiarid regions, where the high variability of the climate produces various accumulation melting cycles throughout the year and a large spatial heterogeneity of the snow cover. This variability in a watershed can often be represented by snow depletion curves (DCs). In this study, terrestrial photography (TP) of a cell-sized area (30 x 30 m) was used to define local snow DCs at a Mediterranean site. Snow cover fraction (SCF) and snow depth (h) values obtained with this technique constituted the two datasets used to define DCs. A flexible sigmoid function was selected to parameterize snow behaviour on this subgrid scale. It was then fitted to meet five different snow patterns in the control area: one for the accumulation phase and four for the melting phase in a cycle within the snow season. Each pattern was successfully associated with the snow conditions and previous evolution. The resulting DCs were able to capture certain physical features of the snow, which were used in a decision-tree and included in the point snow model formulated by Herrero et al. (2009). The final performance of this model was tested against field observations recorded over four hydrological years (2009?2013). The calibration and validation of this DC-snow model was found to have a high level of accuracy with global RMSE values of 84.2 mm for the average snow depth and 0.18 m**2/m**2 for the snow cover fraction in the control area. The use of DCs on the cell scale proposed in this research provided a sound basis for the extension of point snow models to larger areas by means of a gridded distributed calculation.
resolution: 512*512*244; unsigned short; a 0.25mm resolution.This dataset was obtained using the dental CBCT imaging system ZCB100 (Shenzhen ZhongKe TianYue Technology Co., Ltd.) at 110 kV and 10 mAs, with a 15.36-cm FOV
Dataset for the practice in the data preprocessing and unsupervised learning in the introduction to bioinformatics course
Using SVMs, KNN, and Random Forests on the MNIST dataset. I want to see which algorithm performs better.
Context The data is based on images I have taken with my Lytro Illum camera (https://pictures.lytro.com/ksmader) they have been exported as image data and depth maps. The idea is to make and build tools for looking at Lytro Image data and improving the results Content The data are from the Lytro Illum and captured as 40MP images which are then converted to 5MP RGB+D images. All of the required data for several test images is provided The second datasets come from the Lenovo Phab2 (Project Tango) which utilizes dual image sensors to recreate point clouds of large 3D structures. These are provided as .ply and .obj datasets Acknowledgements The data is based on images I have taken with my Lytro Illum camera (https://pictures.lytro.com/ksmader). Inspiration 1. Build a neural network which automatically generates depth information from 2D RGB images 2. Build a tool to find gaps or holes in the depth images and fixes them automatically 3. Build a neural network which can reconstruct 3D pixel data from RGBD images
This task is designed to test the differences between novices' and experts' relevance assessments. We employ the formulated queries obtained from the <em>query formulation</em> task to build a single system ranking of candidate relevant documents. Crowd workers were then provided with a medical cases (among 113 topics) and this list of top 10 candidate relevant documents. For each query-list of top 10 results, we obtained from 2 experts and 2 novices the relevance judgement in 3-point scale. More details of this task can be found in our paper in references. Fields of the csv file: <strong>dataset</strong>: <em>CLEF_eHealth</em> or <em>OHSUMED</em> <strong>topic_id</strong>: ID of the topic (from 1 to 50 for CLEF_eHealth, from 50 to 113 for OHSUMED) <strong>answerer_type</strong>: <em>expert</em> or <em>novice</em> <strong>answerer_id</strong>: ID of crowd worker <strong>doc_id</strong>: ID of the candidate document in the dataset <strong>relevance_score</strong>: the relevance rate given by the crowd worker <strong>information_need</strong>: the glues about the desired content of relevant documents <strong>task_context</strong>: the medical cases that triggers the information need
Degree centrality (DC) and local functional connectivity density (lFCD) are statistics calculated from brain connectivity graphs that measure how important a brain region is to the graph. DC (a.k.a. global functional connectivity density) is calculated as the number of connections a region has with the rest of the brain (binary DC), or the sum of weights for those connections (weighted DC). lFCD was developed to be a surrogate measure of DC that is faster to calculate by restricting its computation to regions that are spatially adjacent. Although both of these measures are popular for investigating inter-individual variation in brain connectivity, efficient neuroimaging tools for computing them are scarce. The goal of this Brainhack project was to contribute optimized implementations of these algorithms to the widely used, open source, AFNI software package. Tools for calculating DC (3dDegreeCentrality) and lFCD (3dLFCD) were implemented by modifying the C source code of AFNI’s 3dAutoTcorrelate tool. 3dAutoTcorrelate calculates the voxel voxel correlation matrix for a dataset and includes most of the functionality we require, including support for OpenMP multithreading to improve calculation time, the ability to restrict the calculation using a user-supplied or auto-calculated mask, and support for both Pearson’s and Spearman correlation. Outputs from the newly developed tools were benchmarked to Python implementations of these measures from the Configurable Pipeline for the Analysis of Connectomes (C-PAC) using the publically shared Intrinsic Brain Activity Test-Retest (IBATRT) dataset from the Consortium for Reliability and Reproducibility.
Copyright information:Taken from "Megx.net—database resources for marine ecological genomics"Nucleic Acids Research 2005;34(Database issue):D390-D393.Published online 28 Dec 2005PMCID:PMC1347433.© The Author 2006. Published by Oxford University Press. All rights reserved () Marine genomes and metagenomic fragments can be browsed and searched on a world map on our web-based system. () An example showing a Geographic-BLAST search for genes encoding proteorhodopsins in the currently available dataset.
Translated homologous gene alignments from transcriptome data. There are two datasets with the partition files. The 70% complete supermatrix and the 80% complete supermatrix. See text for more details.
Raw data for the 15/1 pollen surface sample dataset obtained from the Neotoma Paleoecological Database.
Incidence dataset for Costa Rica. This file must be located in the same folder than rcode.
Context This dataset is collected from the students of a prominent university in North India. This dataset should be used to create the overall Institutional Report on the basis of student feedback data. Content This dataset is comprised of 6 categories, which includes teaching, course content, examination, lab work, library facilities and extra curricular activities. Data for each category includes two columns, where each column can have any of the three labels, i.e. 0 (neutral), 1 (positive) and -1 (negative). Acknowledgements I am thankful to the students of the institution to share their opinions, which helped me to create this dataset. Inspiration You should try to create the overall institutional report in all disciplines (categories) by analyzing the text based response using the sentiment analysis methods.
Context Chat-80 was a natural language system which allowed the user to interrogate a Prolog knowledge base in the domain of world geography. It was developed in the early '80s by Warren and Pereira; see http://acl.ldc.upenn.edu/J/J82/J82-3002.pdf for a description and http://www.cis.upenn.edu/~pereira/oldies.html for the source files. The canonical metadata on NLTK:
Context The SMS Spam Collection. Content Base on the text on SMS message, we should predict it is spam or not spam. Acknowledgements Thanks for Machine Learning Repository.
The E2E data, a new dataset for training end-to-end, data-driven natural language generation systems in the restaurant domain, which is ten times bigger than existing, frequently used datasets in this area (>5k distinct meaning representations with >50k corresponding natural language reference texts). The E2E dataset poses new challenges: (1) its human reference texts show more lexical richness and syntactic variation, including discourse phenomena; (2) generating from this set requires content selection.
What influences love at first sight? (Or, at least, love in the first four minutes?) This [dataset][1] was compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar for their paper [Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment][2]. Data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four minute "first date" with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests. The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information. See the Speed Dating Data Key document below for details. For more analysis from Iyengar and Fisman, read [Racial Preferences in Dating][3]. Data Exploration Ideas ---------------------- - What are the least desirable attributes in a male partner? Does this differ for female partners? - How important do people think attractiveness is in potential mate selection vs. its real impact? - Are shared interests more important than a shared racial background? - Can people accurately predict their own perceived value in the dating market? - In terms of getting a second date, is it better to be someone\'s first speed date of the night or their last? [1]: http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/ [2]: http://faculty.chicagobooth.edu/emir.kamenica/documents/genderDifferences.pdf [3]: http://faculty.chicagobooth.edu/emir.kamenica/documents/racialpreferences.pdf
The attached file contains the minimal dataset for the above mentioned study, including the data from the (1) preliminary studies (hemolysis and flow measurements), (2) combined laser Doppler flowmetry and remission spectroscopy (O2C), (3) rate of necrosis and (4) blood gas analysis. All files are Microsoft excel sheets. A legend explaining all abbreviations used in the data sheets is attached to each file as a tab.
The customer segments data is included as a selection of 440 data points collected on data found from clients of a wholesale distributor in Lisbon, Portugal. More information can be found on the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers). Note (m.u.) is shorthand for *monetary units*. **Features** 1) `Fresh`: annual spending (m.u.) on fresh products (Continuous); 2) `Milk`: annual spending (m.u.) on milk products (Continuous); 3) `Grocery`: annual spending (m.u.) on grocery products (Continuous); 4) `Frozen`: annual spending (m.u.) on frozen products (Continuous); 5) `Detergents_Paper`: annual spending (m.u.) on detergents and paper products (Continuous); 6) `Delicatessen`: annual spending (m.u.) on and delicatessen products (Continuous); 7) `Channel`: {Hotel/Restaurant/Cafe - 1, Retail - 2} (Nominal) 8) `Region`: {Lisbon - 1, Oporto - 2, or Other - 3} (Nominal)
The dataset includes the model output data shown in Figs. 2-9, and S1-S3, together with the GMT scripts used to generate the plots.
Platforms (or applications) used to process RDF datasets.
NOTE: This dataset has been superseded by GPCP Version 2.3, which is available in RDA dataset ds728.4 [https://rda.ucar.edu/datasets/ds728.4/]. Users are advised to transition to this updated dataset. This dataset contains Version 2.2 of the Global Precipitation Climatology Project (GPCP) combined satellite-gauge precipitation estimate and combined satellite-gauge error estimate. The data are monthly analyses defined on a global 2.5 degree by 2.5 degree longitude/latitude grid and cover the period January 1979 to (delayed) present. A monthly climatology (1979-2011) is also available. Please note that the original binary data were written using the big endian representation of unformatted binary words. Users reading this data on little endian platforms, therefore, will need to byte swap the data. The GPCP was established by the World Climate Research Program (WCRP) and subsequently attached to the Global Energy and Water Exchange program (GEWEX) to address the problem of quantifying the distribution of precipitation around the globe over many years. The general approach is to combine the precipitation information available from each of several sources into a final merged product, taking advantage of the strengths of each data type. The passive microwave estimates are based on Special Sensor Microwave/Imager (SSM/I) and Special Sensor Microwave Imager/Sounder (SSMIS) data from the series of Defense Meteorological Satellite Program (DMSP, United States) satellites that fly in sun-synchronous low-earth orbits at 6am / 6pm. The infrared precipitation estimates are computed primarily from geostationary satellites (United States, Europe, Japan), and secondarily from NOAA series polar-orbiting satellites (United States). Additional low-Earth orbit estimates include Atmospheric Infrared Sounder (AIRS) data from the NASA Aqua, and Television Infrared Observation Satellite Program (TIROS) Operational Vertical Sounder (TOVS) and Outgoing Longwave Radiation Precipitation Index (OPI) data from the NOAA series satellites. The precipitation gauge data are assembled and analyzed by the Global Precipitation Climatology Centre (GPCC) of the Deutscher Wetterdienst. The Version 2.2 Data Set contains data from the following contributing centers: * GPCP Polar Satellite Precipitation Data Centre - Emission (SSM/I and SSMIS emission estimates) * GPCP Polar Satellite Precipitation Data Centre - Scattering (SSM/I and SSMIS scattering estimates) * GPCP Geostationary Satellite Precipitation Data Centre (GPI and OPI estimates) * NASA/GSFC Sounder Research Team (TOVS and AIRS estimates) * GPCP Global Precipitation Climatology Centre (precipitation gauge analyses) Request to users from the data authors: The GPCP datasets are developed and maintained with international cooperation and are used by the worldwide scientific community. To better understand the evolving requirements across the GPCP user community and to increase the utility of the GPCP product suite, the dataset authors request that a citation be provided for each publication that uses the GPCP products. Please email the citation to george.j.huffman@nasa.gov or david.t.bolvin@nasa.gov. Your help and cooperation will provide valuable information for making future enhancements to the GPCP product suite.
Context This dataset contains CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) images. Built in 1997 as way for users to identify and block bots (in order to prevent spam, DDOS etc.). They have since then been replace by reCAPTCHA because they are breakable using Artificial Intelligence (as I encourage you to do). Content The images are 5 letter words that can contain numbers. The images have had noise applied to them (blur and a line). They are 200 x 50 PNGs. Acknowledgements The dataset comes from [Wilhelmy, Rodrigo & Rosas, Horacio. (2013). captcha dataset.][1] [1]: https://www.researchgate.net/publication/248380891_captcha_dataset Thumbnail image from [Accessibility of CAPTCHAs] [2]: http://www.bespecular.com/blog/accessibility-of-captchas/ Inspiration This dataset is a perfect opportunity to attempt to make Optical Character Recognition algorithms.
Context Coming Soon Content Coming Soon Acknowledgements Special thanks to; http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ Inspiration Coming soon
Complete dataset used in this article
Genotypes used to construct linkage maps used in paper. First sheet contains the data from the smolt dataset. Second sheet contains the data from the meristics dataset.
Context The funniness of joke is very subjective. Having more than 70,000 users rate jokes, can an algorithm be written to identify the universally funny joke? Content - The data file are in **.csv** format. - The complete dataset is 100 rows and 73422 columns. - The complete dataset is split into 3 **.csv** files. - **JokeText.csv** contains the Id of the joke and the complete joke string. - **UserRatings1.csv** contains the ratings provided by the first 36710 users. - **UserRatings2.csv** contains the ratings provided by the last 36711 users. - The dataset is arranged such that the initial users have rated higher number of jokes than the later users. - The rating is a real value between **-10.0** and **+10.0**. - The **empty values** indicate that the user has not provided any rating for that particular joke. Acknowledgements The dataset is associated with the below research paper. [Eigentaste: A Constant Time Collaborative Filtering Algorithm.](http://www.ieor.berkeley.edu/~goldberg/pubs/eigentaste.pdf) Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001. More information and datasets can be found at [http://eigentaste.berkeley.edu/dataset/](http://eigentaste.berkeley.edu/dataset/) Inspiration Since funniness is a very subjective matter, it will be very interesting to see if data science can bring out the details on what makes something funny.
Context This dataset provides user vote data on which video from a pair of videos was funnier. YouTube Comedy Slam was a discovery experiment running on YouTube 2011 and 2012. In the experiment, pairs of videos were shown to users and the users voted for the video that they found funniest. Content The datasets includes roughly 1.7 million votes recorded chronologically. The first 80% are provided here as the training dataset and the remaining 20% as the testing dataset. Each row in this text file represents one anonymous user vote and there are three comma-separated fields. - The first two fields are YouTube video IDs. - The third field is either 'left' or 'right'. - Left indicates the first video from the pair was voted to be funnier than the second. Right indicates the opposite preference. Acknowledgements Sanketh Shetty, 'Quantifying comedy on YouTube: why the number of o's in your LOL matter,' Google Research Blog, [https://research.googleblog.com/2012/02/quantifying-comedy-on-youtube-why.html][1]. Dataset was downloaded from UCI ML repository: [https://archive.ics.uci.edu/ml/datasets/YouTube+Comedy+Slam+Preference+Data][2] [1]: https://research.googleblog.com/2012/02/quantifying-comedy-on-youtube-why.html [2]: https://archive.ics.uci.edu/ml/datasets/YouTube+Comedy+Slam+Preference+Data Inspiration Predict which videos are going to be funny!
Context 2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). Therefore the dataset can be used either for 10-class or 3-class experiments. Acknowledgements Source: Marques de Sá, J.P., jpmdesa '@' gmail.com, Biomedical Engineering Institute, Porto, Portugal. Bernardes, J., joaobern '@' med.up.pt, Faculty of Medicine, University of Porto, Portugal. Ayres de Campos, D., sisporto '@' med.up.pt, Faculty of Medicine, University of Porto, Portugal. Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
ECG-ViEW II sample dataset person table
In this study, we asked crowd workers to generate IoT scenarios by showing them list of trigger (input) and action (output) devices. Each crowd worker created 3 scenarios. Currently, the following attributes are included in the dataset for each scenario: <i>Category, Trigger Devices and their Triggers, Action Devices and their Actions, word count, word per sentences, Long words, unique words, difficult words, Mean Originality, Mean Practicality, Mean Creativity, Sum Creativity, Creative (dichotomous: 0 or 1). and some worker related information such as: Has Smart home experience (Boolean)? Total Experience (Months), Gender, Age, Family Size, Programming Experience, choices of input and output devices.</i>
The example dataset for use in the example analyses
babies.txt bwt - birth weight in ounces (999 unknown) gestation - gestation days parity - 0 means first born age - mom age in years height - mom height in inches weight - mom pre-pregnancy weight in pounds smoke - mom smoke, 0 means no, 1 means yes, 9 means unknown babies23.txt id - id number pluralty - 5 means single fetus outcome - 1 for live birth that survived at least 28 days date - birth date 1096=January 1, 1961 (this might be a timestamp, not very sure) gestation - gestation days sex - infant sex, 1=male, 2=female, 9=unknown wt - birth weight in ounces parity - 0 means first born race - mom race, 0-5=white, 6=mex, 7=black, 8=asian, 9=mix, 99=unknown age - mom age in years ed - mom education, 0=(<8), 1=(8-<12), 2=12, 3=12+trade, 4=12+some college, 5=16, 7=trade (hs unclear), 9=unknown ht - mom height in inches wt - mom pre-pregnancy weight in pounds (notice that this column name will be renamed to wt.1, since there are two duplicate wt column names) drace - dad race dage - dad age ded - dad education dht - dad height dwt - dad weight marital - 1=married, 2-4=sep, div, wid, 5=never married, blank inc - total income in 2500 increments, 0=under 2500, 1=2500-4999, ..., 9=22500+, 98=unknown, 99=not asked smoke - mom smoke, 0=never, 1=yes now, 2=until pregnancy, 3=once did not now, 9=unknown time - how long ago quit, 0=never, 1=still, 2=during preg, 3=up to 1 yr, 4=up to 2 yr, 5=up to 3 yr, 6=up to 4 yr, 7=5 to 9 yr, 8=10+ yr, 9=quit and don't know, 98=unknown number - number of cigs smoke a day for past and current smokers, 0=never, 1=1-4, 2=5-9, 3=10-14, 4=15-19, 5=20-29, 6=30-39, 7=40-60, 8=60+, 9=smoke but don't know, 98=unknown
Context We do have a dataset with given loans and its arrears rate which allow a supervised machine learning. Content This is a dataset of given loans with its default rate
Variation of hospital charges in the various hospitals in the US for the top 100 diagnoses. The dataset is owned by the US government. It is freely available on [data.gov](https://data.gov.) The dataset keeps getting updated periodically [here](https://data.cms.gov/Medicare/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3) This dataset will show you how price for the same diagnosis and the same treatment and in the same city can vary differently across different providers. It might help you or your loved one find a better hospital for your treatment. You can also analyze to detect fraud among providers.
Dataset used for the analyses
This dataset is a simplified version of the FocaLens dataset.Since the online space is limited, we have to resize the images to 480x320, which is the size of input image in our proposed model.The full size dataset will be published sooner.
Context Medium is one of the most famous tools for spreading knowledge about almost any field. It is widely used to published articles on ML, AI, and data science. This dataset is the collection of about 350 articles in such fields. Content The dataset contains articles, their title, number of claps it has received, their links and their reading time. Acknowledgements This dataset was scraped from [Medium](https://medium.com/). I created a Python script to scrap all the required articles using just their tags from Medium. Check out the script [here](https://github.com/Hsankesara/medium-scrapper) Inspiration How to write a good article? How to inform the reader in an interesting way? What sort of title attracts more crowd? How long an article should be?
Context Find the best strategies to improve for the next marketing campaign. How can the financial institution have a greater effectiveness for future marketing campaigns? In order to answer this, we have to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions in order to develop future strategies. Source [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
Context [Emergent.info](http://www.emergent.info/) was a major rumor tracker, created by veteran journalist [Craig Silverman](https://twitter.com/CraigSilverman). It has been defunct for a while, but its well-structured format and well-documented content provides an opportunity for analyzing rumors on the web. [Snopes.com](http://www.snopes.com/) is one of the oldest rumors trackers on the web. Originally launched by Barbara and David Mikkelson, it is now run by a team of editors who investigate urban legends, myths, viral rumors and fake news. The investigators try to provide a detailed explanation for why they have chosen to confirm or debunk a rumor, often citing several web pages and other external sources. [Politifact.com](http://www.politifact.com/) is a fact-checker that is focused on statements made by politicians and claims circulated by political campaigns, blogs and similar websites. Politifact\'s labels range from "true," to "pants on fire!" --- Content This dataset consists of three files. One file is a collection of all webpages cited in Emergent.info, and the second is a collection of webpages cited in Snopes.com, and the third is a similar collection from Politifact.com. The webpages were often cited because they had started a rumor, shared a rumor, or debunked a rumor. Emergent.info Emergent.info often provides a clean timeline of the rumor\'s propagation on the web, and identifies which page was for the rumor, which page was against it, and which page was simply observing it. Please refer to the image below to learn more about the fields in this dataset. ![The image displays a sample post from Emergent.info and highlights the corresponding fields in emergent.csv.][1] Snopes.com The structure of posts on **Snopes.com** is not as well-defined. Please refer to the image below to learn more about the fields in the Snopes dataset. ![This image displays a sample post from Snopes.com and highlights the corresponding fields in snopes.csv.][2] Politifact.com Similar to Emergent.info, Politifact.com follows a well-structured format in reporting and documenting rumors. There is a sidebar on the right side of each page that lists all of the sources cited within the page. The top link is the likeliest to be the original source of the rumor. For this link, page_is_first_citation is set to true. ![This image displays a sample post from Politifact.com and highlights the corresponding fields in politifact.csv.][3] --- Inspiration I created this dataset in order to study domains that frequently start, propagate, or debunk rumors. By studying these domains and people who follow them, I hope to gain some insight into the dynamics of rumor propagation on the web, as well as social media. --- Notes/Disclaimer When using the Snopes dataset, please keep the following in mind: * In addition to debunking rumors, Snopes.com occasionally reports news and other types of content. This collection only includes data from "[Fact Check](http://www.snopes.com/category/facts)" posts on Snopes. * Snopes.com was launched years ago. Some of the older posts on the website do not follow the current format of the site, therefore some of the fields might be missing. * Snopes.com used to use a service named "[DoNotLink.com](https://twitter.com/donotlink?lang=en)" for citation purposes. That service is no longer active and as a result some of the links are missing from older posts on Snopes. * In addition, some of the shortened links would time-out prior to resolution, in which case they would not be added to the dataset. * Occasionally, a website that has been cited has not maliciously started a rumor. For instance, Andy Borowitz is a humorist who writes for *The New Yorker*. His satirical column is sometimes mistaken for real news; as a result, *The New Yorker* may be cited as a source of fake news on [Snopes.com](http://www.snopes.com/trump-blasts-media-for-reporting-things-he-says/). This does not mean that *The New Yorker* is a fake news website. When using the Politifact dataset, please keep the following in mind: * The data included in this dataset are collected from the "[truth-o-meter](http://www.politifact.com/punditfact/statements/)" page of Politifact.com. * Politifact often fact-checks statements made by politicians. Since this dataset is focused on websites, I have ignored all the posts in which the rumor was attributed to a person, a political party, a campaign, or an organization. Instead, I have only included rumors attributed explicitly to websites or blogs. --- Useful Tips for Using the Snopes collection As opposed to the Emergent collection where each page is flagged with whether it was for or against a rumor, no such information is available for the Snopes dataset. To avoid manually labeling the data, you may use the following heuristics to identify which page started a rumor: * Webpages that are cited in the "Examples" section of a post are often "observing" the rumor, i.e. they have not started it, but they are repeating it. In the snopes.csv file, these webpages have been flagged as "page_is_example." * Webpages that are cited in the "Featured Image" section of a post are often not related to the rumor. The editors on Snopes have simply extracted an image from those pages to embed in their posts. In the snopes.csv file, these webpages have been flagged as "page_is_image_credit." * Webpages that are cited through a secondary service (such as [archive.is](http://archive.is/)) are likelier to be rumor-propagators. Editors do not link to them directly so that a record of their page is available, even if it is later deleted. * If neither of these hints help, very often (but not always) the first link cited on the page (for which "page_is_example" and "page_is_image_credit" are false) is the link to a page that started the rumor. This link is identified by the "page_is_first_citation" field. Pages for which both "page_is_first_citation" and "page_is_archived" are true are very likely to be rumor propagators. * To identify satirical websites that are mistaken for real news, it\'s useful to inspect the way they are cited on Snopes. To demonstrate that a website contains satire or humor, Snopes writers often cite the "about us" page of the site. Therefore it\'s useful to see which domains often contain a URI to their "about" page (e.g. "http://politicops.com/about-us/"). [1]: http://imgur.com/JZPExar.png [2]: http://i.imgur.com/jFT6Vdb.png [3]: http://i.imgur.com/Z83JP7c.png
History I have made the database of photos sorted by products and brands. Screenshots were performed only on official brand websites. Content The main dataset (style.zip) is 2184 color images (150x150x3) with 7 brands and 10 products, and the file with labels `style.csv`. Photo files are in the `.png` format and the labels are integers and values. The file `StyleColorImages.h5` consists of preprocessing images of this set: image tensors and targets (labels). Acknowledgements I have published the data for absolutely free using by any site visitor. But this database contains the names of famous brands, so it can not be used for commercial purposes. Usage Classification, image recognition and colorizing, etc. in a case of a small number of images are useful exercises. The main question we can try to answer with the help of the data is whether the algorithms can recognize the unique design style well enough. To facilitate the task, I chose the most easily recognizable brands with a bright style. [The example of usage](https://github.com/OlgaBelitskaya/deep_learning_projects/blob/master/DL_PP4/DL_PP4_Solutions.ipynb) Improvement There are lots of ways for improving this set and the machine learning algorithms applying to it. At first, it needs to increase the number of photos.
150 normal and134 nodule images in dataset.
This isn't a dataset, it is a collection of kernels written on Kaggle that use no data at all.
Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
This dataset provides 8-day global gross primary production (GPP) at 0.05° latitude by 0.05° longitude for 1982-2017. (1) Model description and performance The GPP dataset was generated by the revised EC-LUE model by integrating the regulations of several major environmental variables: atmospheric CO2 concentration, radiation components (i.e., direct and diffuse radiations), and atmospheric vapor pressure deficit (VPD). The revised EC-LUE performed well in simulating the spatial, seasonal, and interannual variations in global GPP. Particularly, it has a unique superiority in reproducing the interannual variations in GPP at both site level and global scales. (2) Dataset information Each .zip file contains all the 8-day GPP of a year at daily value. To obtain the summation of each 8-day (or 5-day or 6-day) period, please multiply the GPP value by corresponding days (8 for the first 45 values, and 5 or 6 for the last value). Data format: HDF Spatial extent: 90S-90N 180W-180E Fill value: 65535 Scale factor: 0.01 Unit: g C m<sup>-2</sup> day<sup>-1</sup> Any questions about the GPP dataset can be corresponded to yuanwpcn@126.com (Wenping Yuan).
This file contains GIS information on the aggregated distribution of ecosystem services (ES) over the study area (French Alps). Sixteen ES were included as binary datasets (presence/absence - threshold at third quartile) to calculate the number of present ES at a resolution of 1*1km.
This presentation, part of the RDS-in-Flight series, explores one of the prime motivators for sharing data openly, namely the prospect for its reuse by other researchers. Various open data platforms store, describe and cite their datasets differently, with a range of practices around providing data citations, licensing practices, and the use of metadata schema. This presentation, aimed at librarians, focuses on various open data repositories, aggregators, and data tools and how they can be used to find, gather, and cite open data for future studies, and how to search them for open data sets relevant to specific disciplines. The presentation also describes how to create a collection on ZivaHub, UCT's institutional data repository, and how to link open datasets to theses/dissertations or other kinds of research outputs using a data availability statement.
This task is designed to study the impact of domain expertise on query formulation. The crowd workers, divided in two types (expert and novice in medical domain) were asked to build the appropriate query that allows achieving the search task using a pair of facets: (1) search context and (2) information need. A total of 113 topics (of 2 datasets CLEFeHealth and OHSUMED) were submitted to 3 experts and 3 novices for self-query generations; therefore 6 queries ware formulated for each topic. More details of this task can be found in our paper in references. Fields of the csv file: <strong>dataset</strong>: <em>CLEF_eHealth</em> or <em>OHSUMED</em> <strong>topic_id</strong>: ID of the topic (from 1 to 50 for CLEF_eHealth, from 50 to 113 for OHSUMED) <strong>answerer_type</strong>: <em>expert</em> or <em>novice</em> <strong>answerer_id</strong>: ID of crowd worker <strong>composed_query</strong>: query formulated by the crowd worker <strong>information_need</strong>: the glues about the desired content of relevant documents <strong>task_context</strong>: the medical cases that triggers the information need
Dataset of the chromosome number polymorphism in Asteraceae family.
Context ISCO is a tool for organizing jobs into a clearly defined set of groups according to the tasks and duties undertaken in the job. Content Occupational codes broken down by Major, Sub-Major, Minor and Unit groups. Acknowledgements The International Labour Organization - [http://www.ilo.org/][1] Inspiration A simple breakdown of the occupational codes provided by the ILO. Country specific information to be added soon. [1]: http://www.ilo.org/
Data description:This product provides annual (1985-2015) 30-m vegetation phenology (i.e., start of season-SOS; end of season-EOS) in urban areas of the conterminous United States, including: (1) Information about urban clusters *** uCluster_USA_gt500.zip (format: ESRI shapefile): bounding box of each urban cluster with the cluster ID and cityName. The ESRI file can be opened by many opensource softwares (e.g., QGIS)*** US_uCluster_UrbanRuralExtents.zip: spatial extent of urban clusters (‘US_uCluster_label.tif’) and urban and surrounding rural clusters (‘US_uCluster_label_withRural.tif’).(2) Phenology dataset In each zip file, it includes three phenology datasets:*** annual SOS from 1985-2015 for each urban cluster *** annual EOS from 1985-2015 for each urban cluster *** COR: correlation of fitted double logistic curve to the observed EVIS for each pixel. The COR should be divided by 10000, and it serves as an uncertainty layer to indicate the fitting performance of the double logistic model.Any questions about this data can be corresponded to Prof. Yuyu Zhou (zhouyuyu@gmail.com)
We introduce the Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks. We describe its creation process, its content, and its possible uses. Attractive features of the Million Song Database include the range of existing resources to which it is linked, and the fact that it is the largest current research dataset in our field. As an illustration, we present year prediction as an example application, a task that has, until now, been difficult to study owing to the absence of a large set of suitable data. We show positive results on year prediction, and discuss more generally the future development of the dataset.
Despite the fact that extensive list of linked open datasets are available in catalogues, most of the data publishers still connects their datasets to the most popular ones, such as DBpedia, Freebase and Geonames. Although the linkage with popular datasets would allow us to explore external resources, it would fail to cover highly specialized information. Catalogues of linked data describe the content of datasets in terms of the update periodicity, authors, SPARQL endpoints, linksets, amongst others, as recommended by W3C VoID Vocabulary. However, catalogues by themselves do not provide any explicit information to help the URI linkage process. Searching techniques can rank available datasets according to the likelyhood that it will be possible to find links between them and a given target dataset, so that most of the links, if not all, could be found by inspecting the most relevant datasets in the ranking.This dataset contains dataset descriptions using the VoID vocabulary for supporting the evaluation of searching techniques. The descriptions of each dataset include their linksets, classes, properties and topic categories which were harvested from the Datahub catalogue, dataset dumps, void files and the DBpedia knowledge graph. The DBpedia Spotlight allowed the detection of named entities in textual literals and thereafter of the DBpedia topic categories of each entity which were taken as the topic categories od the datasets containing de entities.
Context This data is a representation of a minutiae in a 16x16 image, which became a 256 vector. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Arabic Handwritten Characters Dataset Astract Handwritten Arabic character recognition systems face several challenges, including the unlimited variation in human handwriting and large public databases. In this work, we model a deep learning architecture that can be effectively apply to recognizing Arabic handwritten characters. A Convolutional Neural Network (CNN) is a special type of feed-forward multilayer trained in supervised mode. The CNN trained and tested our database that contain 16800 of handwritten Arabic characters. In this paper, the optimization methods implemented to increase the performance of CNN. Common machine learning methods usually apply a combination of feature extractor and trainable classifier. The use of CNN leads to significant improvements across different machine-learning classification algorithms. Our proposed CNN is giving an average 5.1% misclassification error on testing data. Context The motivation of this study is to use cross knowledge learned from multiple works to enhancement the performance of Arabic handwritten character recognition. In recent years, Arabic handwritten characters recognition with different handwriting styles as well, making it important to find and work on a new and advanced solution for handwriting recognition. A deep learning systems needs a huge number of data (images) to be able to make a good decisions. Content The data-set is composed of **16,800** characters written by 60 participants, the age range is between 19 to 40 years, and 90% of participants are right-hand. Each participant wrote each character (from ’alef’ to ’yeh’) ten times on two forms as shown in Fig. 7(a) & 7(b). The forms were scanned at the resolution of 300 dpi. Each block is segmented automatically using Matlab 2016a to determining the coordinates for each block. The database is partitioned into two sets: a training set (13,440 characters to 480 images per class) and a test set (3,360 characters to 120 images per class). Writers of training set and test set are exclusive. Ordering of including writers to test set are randomized to make sure that writers of test set are not from a single institution (to ensure variability of the test set). In an experimental section we showed that the results were promising with **94.9%** classification accuracy rate on testing images. In future work, we plan to work on improving the performance of handwritten Arabic character recognition. Acknowledgements Ahmed El-Sawy, **Mohamed Loey**, Hazem EL-Bakry, **Arabic Handwritten Characters Recognition using Convolutional Neural Network**, WSEAS, 2017 Our proposed CNN is giving an average **5.1%** misclassification error on testing data. Inspiration Creating the proposed database presents more challenges because it deals with many issues such as style of writing, thickness, dots number and position. Some characters have different shapes while written in the same position. For example the teh character has different shapes in isolated position. Benha University http://bu.edu.eg/staff/mloey https://mloey.github.io/
Dataset has been taken form hackerearth deep learning competition [enter link description here][1] [1]: https://www.hackerearth.com/challenge/competitive/deep-learning-beginner-challenge/machine-learning/predict-the-energy-used-612632a9-9de79188
The data was colleted on october and november 2017
Copyright information:Taken from "WholePathwayScope: a comprehensive pathway-based analysis tool for high-throughput data"BMC Bioinformatics 2006;7():30-30.Published online 19 Jan 2006PMCID:PMC1388242.Copyright © 2006 Yi et al; licensee BioMed Central Ltd. The PSCP file for "Cholesterol Synthesis" pathway analyzed for the data from 11 different microarray datasets or CRI files representing a time course experiment (see for description of material and data preparation). Pooled hepatic mRNA were isolated from female wild type mice sacrificed at different time points during fetal and post-natal development indicated. The time point at 9 day before birth was used as the reference level of mRNA. "Day-5" and "Day-3" indicates 5 days or 3 day prior to birth, respectively.
In process to edition .. Context Malicious websites are of great concern due it is a problem to analyze one by one and to index each URL in a black list. Unfortunately, there is a lack of datasets with malicious and benign web characteristics. This dataset is a research production of my bachelor students whose aims to fill this gap. *This is our first dataset version got from our web security project, we are working to improve its results* Content The project consisted to evaluate different classification models to predict malicious and benign websites, based on application layer and network characteristics. The data were obtained by using different verified sources of benign and malicious URL's, in a low interactive client honeypot to isolate network traffic. We used additional tools to get other information, such as, server country with Whois. This is the first version and we have some initial results from applying machine learning classifiers in a bachelor thesis. Further details on the data process making and the data description can be found in the article below. URL Dataset This is an important topic and one of the most difficult thing to process, according to other articles and another open resource, we used three black list: + machinelearning.inginf.units.it/data-andtools/hidden-fraudulent-urls-dataset + malwaredomainlist.com + zeuztacker.abuse.ch From them we got around 185181 URLs, we supposed that they were malicious according to their information, we recommend in a next research step to verity them though another security tool, such as, VirusTotal. We got the benign URLs (345000) from https://github.com/faizann24/Using-machinelearning-to-detect-malicious-URLs.git, similar to the previous step, a verification process is also recommended through other security systems. Framework First we made different scripts in Python in order to systematically analyze and generate the information of each URL (**During the next months we will liberate them to the open source community on GitHub**). First we verified that each URL was available through the libraries in Python (such as request), we started with around 530181 samples, but as a results of this step the samples were filtered and we got 63191 URLs. ![Framework to detect malicious websites][1] Feature generator: During the research process we found that one way to study a malicious website was the analysis of features from its application layer and network layer, in order to get them, the idea is to apply the dynamic and static analysis. In the dynamic analysis some articles used web application honeypots kind high interaction, but these resources have not been updated in the last months, so maybe some important vulnerabilities were not mapped. Data Description + URL: it is the anonimous identification of the URL analyzed in the study + URL_LENGTH: it is the number of characters in the URL + NUMBER_SPECIAL_CHARACTERS: it is number of special characters identified in the URL, such as, “/”, “%”, “”, “&”, “. “, “=” + CHARSET: it is a categorical value and its meaning is the character encoding standard (also called character set). + SERVER: it is a categorical value and its meaning is the operative system of the server got from the packet response. + CONTENT_LENGTH: it represents the content size of the HTTP header. + WHOIS_COUNTRY: it is a categorical variable, its values are the countries we got from the server response (specifically, our script used the API of Whois). + WHOIS_STATEPRO: it is a categorical variable, its values are the states we got from the server response (specifically, our script used the API of Whois). + WHOIS_REGDATE: Whois provides the server registration date, so, this variable has date values with format DD/MM/YYY HH:MM + WHOIS_UPDATED_DATE: Through the Whois we got the last update date from the server analyzed + TCP_CONVERSATION_EXCHANGE: This variable is the number of TCP packets exchanged between the server and our honeypot client + DIST_REMOTE_TCP_PORT: it is the number of the ports detected and different to TCP + REMOTE_IPS: this variable has the total number of IPs connected to the honeypot + APP_BYTES: this is the number of bytes transfered + SOURCE_APP_PACKETS: packets sent from the honeypot to the server + REMOTE_APP_PACKETS: packets received from the server + APP_PACKETS: this is the total number of IP packets generated during the communication between the honeypot and the server + DNS_QUERY_TIMES: this is the number of DNS packets generated during the communication between the honeypot and the server + TYPE: this is a categorical variable, its values represent the type of web page analyzed, specifically, 1 is for malicious websites and 0 is for benign websites Conclusions and future works Acknowledgements If your papers or other works use our dataset, please cite our paper: Urcuqui, C., Navarro, A., Osorio, J., & Garcıa, M. (2017). Machine Learning Classifiers to Detect Malicious Websites. CEUR Workshop Proceedings. Vol 1950, 14-17. If you need a review article of website cybersecurity state of the art (in English and Spanish): Urcuqui, C., Peña, M. G., Quintero, J. L. O., & Cadavid, A. N. (2017). Antidefacement. Sistemas & Telemática, 14(39), 9-27 If you have any question or feedback, please contact me: ccurcuqui@icesi.edu.co Thank you for your comments, it is so important to get your feedback for our future work - deardle GitHub https://github.com/urcuqui/WhiteHat/tree/master/Research/Web%20security [1]: https://github.com/urcuqui/WhiteHat/blob/master/Research/Web%20security/frameworks/framework%20to%20detect%20malicious%20websites.jpg
Download the dataset from our project page:https://github.com/wsdream/wsdream-dataset
Non-redundant dataset of ncRNA from ovary, pituitary and hypothalamus in fasta format. (TXT 64614 kb)
Comprehensive hydrometeorologcial dataset collected at experimental farms in the University of Melbourne's Dookie Campus.
This dataset is scrapped from goodreads website. It has the ratings of first 99 users of the website. The Bookfeatures csv contains the features of the books read and rated by these users.
Transect dataset. This Excel document lists all transects, their location, how much they were walked, and the number of flocks seen on them. The number of flocks includes incomplete flocks and is divided into Orange-billed Babbler (OBBA) led flocks and all other flocks, which are DISTANCE-adjusted to determine flock density.
Fungi spectral dataset in MATLAB format. Please cite:- Costa FSL, Silva PP, Morais CLM, et al. Attenuated total reflection Fourier transform infrared (ATR-FTIR) spectroscopy as a new technology for discrimination between Cryptococcus neoformans and Cryptococcus gattii. Anal Methods 2016; 8: 7107–7115.- Morais CLM, Costa FSL, Lima KMG. Variable selection with a support vector machine for discriminating Cryptococcus fungal species based on ATR FTIR spectroscopy. Anal Methods 2017; 9: 2964–2970.
Context Is the movie industry dying? is Netflix the new entertainment king? Those were the first questions that lead me to create a dataset focused on movie revenue and analyze it over the last decades. But, why stop there? There are more factors that intervene in this kind of thing, like actors, genres, user ratings and more. And now, anyone with experience (you) can ask specific questions about the movie industry, and get answers. Content There are 6820 movies in the dataset (220 movies per year, 1986-2016). Each movie has the following attributes: - budget: the budget of a movie. Some movies don't have this, so it appears as 0 - company: the production company - country: country of origin - director: the director - genre: main genre of the movie. - gross: revenue of the movie - name: name of the movie - rating: rating of the movie (R, PG, etc.) - released: release date (YYYY-MM-DD) - runtime: duration of the movie - score: IMDb user rating - votes: number of user votes - star: main actor/actress - writer: writer of the movie - year: year of release Acknowledgements This data was scraped from IMDb. Contribute You can contribute via [GitHub](https://github.com/Juanets/movie-stats).
A large scale dataset for complex Question Answering.
This is primary and secondary event record file of LSDO dataset. Please see reference link to get access to full dataset with solar images.
Talk given during the "Harmonise This! Analyzing Diverse Neuroimaging Datasets" workshop at the 2015 Organization for Human Brain Mapping (OHBM) conference in Hawaii, 14-18 June.
NEXUS file containing the 'concatenated mtDNA' dataset alignment, which includes 171 mtDNA subsamples sequenced for the mitochondrial cytochrome b and/or cytochrome oxidase 1 genes.
The below information is from the project page: https://nlp.stanford.edu/projects/glove/ Context GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. Content Due to size constraints, only the 25 dimension version is uploaded. Please visit the project page for GloVe of other dimensions. This dataset (https://www.kaggle.com/rtatman/glove-global-vectors-for-word-representation) contains GloVe extracted from Wikipedia 2014 + Gigaword 5. 1. Nearest neighbors The Euclidean distance (or cosine similarity) between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words. Sometimes, the nearest neighbors according to this metric reveal rare but relevant words that lie outside an average human's vocabulary. 2. Linear substructures The similarity metrics used for nearest neighbor evaluations produce a single scalar that quantifies the relatedness of two words. This simplicity can be problematic since two given words almost always exhibit more intricate relationships than can be captured by a single number. For example, man may be regarded as similar to woman in that both words describe human beings; on the other hand, the two words are often considered opposites since they highlight a primary axis along which humans differ from one another. In order to capture in a quantitative way the nuance necessary to distinguish man from woman, it is necessary for a model to associate more than a single number to the word pair. A natural and simple candidate for an enlarged set of discriminative numbers is the vector difference between the two word vectors. GloVe is designed in order that such vector differences capture as much as possible the meaning specified by the juxtaposition of two words. Acknowledgements Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Inspiration The dataset specifically includes tokens extracted from Twitter, which unlike tokens from Wikipedia, include many abbreviations that have interesting content.
resolution: 512*512*230; unsigned short; a 0.3mm resolution.This dataset was obtained using the dental CBCT imaging system ZCB100 (Shenzhen ZhongKe TianYue Technology Co., Ltd.) at 110 kV and 10 mAs, with a 15.36-cm FOV
The file contains a correlation analysis of the skill requirements for software testers. The dataset comes from 400 job advertisements.We use the file to look for correlated skills, in our quest to find if there are preset profiles of the software testers emerging from the demands formulated by employers at hiring.
This dataset presents the first global fuel map, containing all the parameters required to be input in the Fuel Characteristic Classification System (FCCS). The dataset was developed from different spatial variables, both based on satellite Earth observation products and fuel databases, and is comprised by a global fuelbed map and a database that includes the parameters of each fuelbed that affect fire behavior and effects. A total of 274 fuelbeds were created and parameterized, and can be input into FCCS to obtain fire potentials, surface fire behavior and carbon biomass for each fuelbed.The global fuel dataset can be used for a varied range of applications, including fire danger assessment, fire behavior estimations, fuel consumption calculations and emissions inventories.
Description The dataset here is a sample of the transactions made in a retail store. The store wants to know better the customer purchase behaviour against different products. Specifically, here the problem is a regression problem where we are trying to predict the dependent variable (the amount of purchase) with the help of the information contained in the other variables. Classification problem can also be settled in this dataset since several variables are categorical, and some other approaches could be "Predicting the age of the consumer" or even "Predict the category of goods bought". This dataset is also particularly convenient for clustering and maybe find different clusters of consumers within it. Acknowledgements The dataset comes from a competition hosted by Analytics Vidhya.
This failure dataset contains the injected faults, the workload, the effects of failure (both the user-side impact and our own in-depth correctness checks), and the error logs produced by the OpenStack cloud management system.Please refers to the paper "Empirical analysis of software failures in the OpenStack cloud computing platform" (ESEC/FSE \'19).
Context The key to success in any organization is attracting and retaining top talent. I’m an HR analyst at my company, and one of my tasks is to determine which factors keep employees at my company and which prompt others to leave. I need to know what factors I can change to prevent the loss of good people. Watson Analytics is going to help. Content I have data about past and current employees in a spreadsheet on my desk top. It has various data points on our employees, but I’m most interested in whether they’re still with my company or whether they’ve gone to work somewhere else. And I want to understand how this relates to workforce attrition. **Education** 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor' **EnvironmentSatisfaction** 1 'Low' 2 'Medium' 3 'High' 4 'Very High' **JobInvolvement** 1 'Low' 2 'Medium' 3 'High' 4 'Very High' **JobSatisfaction** 1 'Low' 2 'Medium' 3 'High' 4 'Very High' **PerformanceRating** 1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding' **RelationshipSatisfaction** 1 'Low' 2 'Medium' 3 'High' 4 'Very High' **WorkLifeBalance** 1 'Bad' 2 'Good' 3 'Better' 4 'Best' Acknowledgements https://www.ibm.com/communities/analytics/watson-analytics-blog/watson-analytics-use-case-for-hr-retaining-valuable-employees/ Inspiration Which factors led to employee attrition?
EmBeD,<i> Energy-based anomaly detector in the cloud</i>, is an approach to detect anomalies at runtime based on the free energy of a Restricted Boltzmann Machine (RBM) model. The free energy is a stochastic function that can be used to efficiently score anomalies for detecting outliers. EmBeD analyzes the system behavior from raw metric data, does not require extensive training with seeded faults, and classifies the relation of anomalous behaviors with future failures with very few false positives. The file data.zip contains the dataset used for validating <i>EmBeD</i>.
The objective of the BRFSS is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases in the adult population. Factors assessed by the BRFSS include tobacco use, health care coverage, HIV/AIDS knowledge or prevention, physical activity, and fruit and vegetable consumption. Data are collected from a random sample of adults (one per household) through a telephone survey. The Behavioral Risk Factor Surveillance System (BRFSS) is the nation's premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world. Content - Each year contains a few hundred columns. Please see one of the [annual code books][1] for complete details. - These CSV files were converted from a SAS data format using pandas; there may be some data artifacts as a result. - If you like this dataset, you might also like the data for 2001-2010. Acknowledgements This dataset was released by the CDC. You can find the original dataset and [additional years of data here][2]. [1]: https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf [2]: https://www.cdc.gov/brfss/annual_data/annual_data.htm
Context Fashion-MNIST is a dataset of Zalando\'s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. "If it doesn\'t work on MNIST, it won\'t work at all", they said. "Well, if it does work on MNIST, it may still fail on others." Zalando seeks to replace the original MNIST dataset Content Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. The training and test data sets have 785 columns. The first column consists of the class labels (see above), and represents the article of clothing. The rest of the columns contain the pixel-values of the associated image. To locate a pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27. The pixel is located on row i and column j of a 28 x 28 matrix. For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below. Labels Each training and test example is assigned to one of the following labels: 0 T-shirt/top 1 Trouser 2 Pullover 3 Dress 4 Coat 5 Sandal 6 Shirt 7 Sneaker 8 Bag 9 Ankle boot TL;DR Each row is a separate image Column 1 is the class label. Remaining columns are pixel numbers (784 total). Each value is the darkness of the pixel (1 to 255) Acknowledgements Original dataset was downloaded from https://github.com/zalandoresearch/fashion-mnist Dataset was converted to CSV with this script: https://pjreddie.com/projects/mnist-in-csv/ License The MIT License (MIT) Copyright © [2017] Zalando SE, https://tech.zalando.com Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Context Simultaneous tracking of multiple people is still a very challenging computer vision problem. This is especially true for sports activities, for which people often wear similar uniforms, move quickly and erratically, and have close interactions with each other. This dataset is captured with thermal cameras, which enables easier segmentation and ensures privacy of people in public facilities, but at the same time we are left with no distinct appearance information to guide our tracking algorithms. Content This dataset contains four 30-seconds video sequences of eight people playing soccer in an indoor arena (court size 40*20 metres). The video is captured by thermal cameras of type AXIS Q1922 with a resolution of 640*480 pixels and 25 fps. The three images are stitched to one image of 1920*480 pixels. The videos are manually annotated for tracking. Acknowledgements Gade, R. & Moeslund, T.B.: Constrained multi-target tracking for team sports activities. IPSJ Transactions on Computer Vision and Applications (2018) 10: 2. https://doi.org/10.1186/s41074-017-0038-z
Supplementary Data 7. Dataset and best tree obtained considering the 9 configurations. The script used to run the analysis in TNT (land_searches.run) is also included.
Context The No Show problem is one of the bigest on the health industry, about 30% of the patient fail theirs appointments. Content 61K points, from 2017.01.01 to 2017.04.30 and 19 features to work with Data Dictionary 1. especialidad : what kind of specialist is going to. Ie dematologist, etc. 2. edad: Age 3. sexo: sex, 1: Male, 2: Female 4. reserva_mes_d : discrete value for the month of the appointment, 1: Jan, 2: Feb... 5. reserva_mes_c : continue value for the month of the appointment, the formula is COS(2*reserva_mes_d*Pi/12) 6. reserva_dia_d : day of the week for the appointment, 1: Mon... 7: Sun 7. reserva_dia_c : continous value for the day of the week, the formula is COS(2*reserva_dia_d*Pi/7) 8. reserva_hora_d : discrete value for hour of the appointment 9. reserva_hora_c : continous value for the hour of the appointment, the formula is COS(2*reserva_hora_d*Pi/24) 10. creacion_mes_d : discrete value for the month when the appointment was created 11. creacion_mes_c : continous value for the month when the appointment was created, the formula is COS(2*creacion_mes_d*Pi/12) 12. creacion_dia_d : same as reserva_dia_d, but considering the day when the appointment was created 13. creacion_dia_c : same as reserva_dia_c, but considering the day when the appintment was created 14. creacion_hora_d : hour when the appointment was created 15. creacion_hora_c : continous value for the creacion_hour_d, the formula is COS(2*creacion_hora_d*Pi/24) 16. latencia : number of days between the appointment and the date when it was created 17. canal : channel used for the creation of the apppointment, 1: call center, 2: Personal, 3: Web 18. tipo : type of appointment, 1: medical, 2: procedures 19. show : 0: no show, 1: show Inspiration Can we use it to predict if a patient is going to show up for his appointment?
Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements The datasets were downloaded from mortality.org at http://www.mortality.org/cgi-bin/hmd/country.php?cntr=CHE&level=1 Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Context Well, I am a beginner to data science world and decided to work on Natural Language Processing questions. So decided to use my own dataset by collecting the SMS spams. Content This dataset has 2 columns, one is Label- ham or spam and the other is Message which simply is the full message. Acknowledgements I guess AirDroid, otherwise it was pretty tedious to type out everything
The Lahman Baseball Database 2012 Version Release Date: December 31, 2012 ---------- 0.1 Copyright Notice & Limited Use License This database is copyright 1996-2013 by Sean Lahman. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see: http://creativecommons.org/licenses/by-sa/3.0/ For licensing information or further information, contact Sean Lahman at: seanlahman@gmail.com ---------------------------------------------------------------------- 0.2 Contact Information Web site: http://www.baseball1.com E-Mail : seanlahman@gmail.com If you\'re interested in contributing to the maintenance of this database or making suggestions for improvement, please consider joining our mailinglist at: http://groups.yahoo.com/group/baseball-databank/ If you are interested in similar databases for other sports, please vist the Open Source Sports website at http://OpenSourceSports.com ---------------------------------------------------------------------- 1.1 Introduction This database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2012. It includes data from the two current leagues (American and National), the four other "major" leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875. This database was created by Sean Lahman, who pioneered the effort to make baseball statistics freely available to the general public. What started as a one man effort in 1994 has grown tremendously, and now a team of researchers have collected their efforts to make this the largest and most accurate source for baseball statistics available anywhere. (See Acknowledgements below for a list of the key contributors to this project.) None of what we have done would have been possible without the pioneering work of Hy Turkin, S.C. Thompson, David Neft, and Pete Palmer (among others). All baseball fans owe a debt of gratitude to the people who have worked so hard to build the tremendous set of data that we have today. Our thanks also to the many members of the Society for American Baseball Research who have helped us over the years. We strongly urge you to support and join their efforts. Please vist their website (www.sabr.org). This database can never take the place of a good reference book like The Baseball Encyclopedia. But it will enable people do to the kind of queries and analysis that those traditional sources don\'t allow. If you have any problems or find any errors, please let us know. Any feedback is appreciated ---------------------------------------------------------------------- 1.2 What\'s New in 2012 There has been significant cleanup in the master file MLB\'s addition of wildcard games in 2012 adds two new types of records to the post-season files. The abbreviations ALWC and NLWC are used to denote each league\'s wild card game. Added the MLB "Comeback Player of the Year" award to the awards table Florida Marlins changed their name to the Miami Marlins, new team abbr is MIA ---------------------------------------------------------------------- 1.3 Acknowledgements Much of the raw data contained in this database comes from the work of Pete Palmer, the legendary statistician, who has had a hand in most of the baseball encylopedias published since 1974. He is largely responsible for bringing the batting, pitching, and fielding data out of the dark ages and into the computer era. Without him, none of this would be possible. For more on Pete\'s work, please read his own account at: http://sabr.org/cmsfiles/PalmerDatabaseHistory.pdf Two people have been key contributors to the work that followed, first by taking the raw data and creating a relational database, and later by extending the database to make it more accesible to researchers. Sean Lahman launched the Baseball Archive\'s website back before most people had heard of the world wide web. Frustrated by the lack of sports data available, he led the effort to build a baseball database that everyone could use. Baseball researchers everywhere owe him a debt of gratitude. Lahman served as an associate editor for three editions of Total Baseball and contributed to five editions of The ESPN Baseball Encyclopedia. He has also been active in developing databases for other sports. The work of Sean Forman to create and maintain an online encyclopedia at "baseball-reference.com" has been remarkable. Recognized as the premier online reference source, Forman\'s site provides an oustanding interface to the raw data. His efforts to help streamline the database have been extremely helpful. Most importantly, Forman has spearheaded the effort to provide standards that enable several different baseball databases to be used together. He was also instrumental in launching the Baseball Databank, a forum for researchers to gather and share their work. Since 2001, these two Seans have led a group of researchers who volunteered to maintain and update the database. A handful of researchers have made substantial contributions to maintain this database in recent years. Listed alphabetically, they are: Derek Adair, Mike Crain, Kevin Johnson, Rod Nelson, Tom Tango, and Paul Wendt. These folks did much of the heavy lifting, and are largely responsible for the improvements made in the last decade. Others who made important contributions include: Dvd Avins, Clifford Blau, Bill Burgess, Clem Comly, Jeff Burk, Randy Cox, Mitch Dickerman, Paul DuBois, Mike Emeigh, F.X. Flinn, Bill Hickman, Jerry Hoffman, Dan Holmes, Micke Hovmoller, Peter Kreutzer, Danile Levine, Bruce Macleod, Ken Matinale, Michael Mavrogiannis, Cliff Otto, Alberto Perdomo, Dave Quinn, John Rickert, Tom Ruane, Theron Skyles, Hans Van Slootenm, Michael Westbay, and Rob Wood. Many other people have made significant contributions to the database over the years. The contribution of Tom Ruane\'s effort to the overall quality of the underlying data has been tremendous. His work at retrosheet.org integrates the yearly data with the day-by-day data, creating a reference source of startling depth. It is unlikely than any individual has contributed as much to the field of baseball research in the past five years as Ruane has. Sean Holtz helped with a major overhaul and redesign before the 2000 season. Keith Woolner was instrumental in helping turn a huge collection of stats into a relational database in the mid-1990s. Clifford Otto & Ted Nye also helped provide guidance to the early versions. Lee Sinnis, John Northey & Erik Greenwood helped supply key pieces of data. Many others have written in with corrections and suggestions that made each subsequent version even better than what preceded it. The work of the SABR Baseball Records Committee, led by Lyle Spatz has been invaluable. So has the work of Bill Carle and the SABR Biographical Committee. David Vincent, keeper of the Home Run Log and other bits of hard to find info, has always been helpful. The recent addition of colleges to player bios is the result of much research by members of SABR\'s Collegiate Baseball committee. Salary data has been supplied by Doug Pappas, who passed away during the summer of 2004. He was the leading authority on many subjects, most significantly the financial history of Major League Baseball. We are grateful that he allowed us to include some of the data he compiled. His work has been continued by the SABR Business of Baseball committee. Thanks is also due to the staff at the National Baseball Library in Cooperstown who have been so helpful -- Tim Wiles, Jim Gates, Bruce Markusen, and the rest of the staff. A special debt of gratitude is owed to Dave Smith and the folks at Retrosheet. There is no other group working so hard to compile and share baseball data. Their website (www.retrosheet.org) will give you a taste of the wealth of information Dave and the gang have collected. The 2012 database beneifited from the work of Ted Turocy and his Chadwick baseball Bureau. For more details on his tools and services, visit: http://chadwick.sourceforge.net/doc/index.html Thanks to all contributors great and small. What you have created is a wonderful thing. 2.0 Data Tables The design follows these general principles. Each player is assigned a unique number (playerID). All of the information relating to that player is tagged with his playerID. The playerIDs are linked to names and birthdates in the MASTER table. The database is comprised of the following main tables: MASTER - Player names, DOB, and biographical info Batting - batting statistics Pitching - pitching statistics Fielding - fielding statistics It is supplemented by these tables: AllStarFull - All-Star appearances Hall of Fame - Hall of Fame voting data Managers - managerial statistics Teams - yearly stats and standings BattingPost - post-season batting statistics PitchingPost - post-season pitching statistics TeamFranchises - franchise information FieldingOF - outfield position data FieldingPost- post-season fieldinf data ManagersHalf - split season data for managers TeamsHalf - split season data for teams Salaries - player salary data SeriesPost - post-season series information AwardsManagers - awards won by managers AwardsPlayers - awards won by players AwardsShareManagers - award voting for manager awards AwardsSharePlayers - award voting for player awards Appearances Schools SchoolsPlayers
Piezoelectric tensor data.Note on citations: If you found this dataset useful and would like to cite it in your work, please be sure to cite its original sources below rather than or in addition to this page.Dataset is described in the following:De Jong M, Chen W, Geerlings H, Asta M, Persson K (2015) A database to enable discovery and design of piezoelectric materials. Scientific Data 2: 150053. https://doi.org/10.1038/sdata.2015.53Data adapted from JSON files available here:De Jong M, Chen W, Geerlings H, Asta M, Persson K (2015) Data from: A database to enable discovery and design of piezoelectric materials. Dryad Digital Repository. https://doi.org/10.5061/dryad.n63m4
Data analysis software and canonical datasets are the driving force behind many fields of empirical sciences. Despite being of paramount importance, those resources are most often not adequately cited. Although some can consider this a “social” problem, its roots are technical: Users of those resources often are simply not aware of the underlying computational libraries and methods they have been using in their research projects. This in-turn fosters inefficient practices that encourage the development of new projects, instead of contributing to existing established ones. Some projects (e.g. FSL) facilitate citation of the utilized methods, but such efforts are not uniform, and the output is rarely in commonly used citation formats (e.g. BibTeX). DueCredit is a simple framework to embed information about publications or other references within the original code or dataset descriptors. References are automatically reported to the user whenever a given functionality or dataset is being used.DueCredit is currently available for Python, but we envision extending support to other frameworks (e.g., Matlab, R). Until DueCredit gets adopted natively by the projects, it provides the functionality to “inject” references for 3rd party modules.For the developer, DueCredit implements a decorator @due.dcite that allows to link a method or class to a set of references that can be specified through a doi or BibTeX entry. The initial release of DueCredit (0.1.0) was implemented during the OHBM 2015 hackathon and uploaded to pypi and is freely available. DueCredit provides a concise API to associate a publication reference with any given module or function. DueCredit comes with a simple demo code, which demonstrates its utility. DueCredit is in its early stages of development, but two days of team development at the OHBM hackathon were sufficient to establish a usable prototype implementation. Since then, the code-base was further improved and multiple beta-releases followed, expanding the coverage of citable resources (e.g., within scipy, sklearn modules via injections and PyMVPA natively).
The file contains a correlation analysis of the skill requirements for software testers. The dataset comes from 400 job advertisements. We use the file to look for correlated skills, in our quest to find if there are preset profiles of the software testers emerging from the demands formulated by employers at hiring.
Background NLP is a hot topic currently! Team AI really want's to leverage the NLP research and this an attempt for all the NLP researchers to explore exciting insights from bilingual data The Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles” aims mainly at supporting research and development relevant to high-performance multilingual machine translation, information extraction, and other language processing technologies. Unique Features A precise and large-scale corpus containing about 500,000 pairs of manually-translated sentences. Can be exploited for research and development of high-performance multilingual machine translation, information extraction, and so on. The three-step translation process (primary translation -> secondary translation to improve fluency -> final check for technical terms) has been clearly recorded. Enables observation of how translations have been elaborated so it can be applied for uses such as research and development relevant to translation aids and error analysis of human translation. Translated articles concern Kyoto and other topics such as traditional Japanese culture, religion, and history. Can also be utilized for tourist information translation or to create glossaries for travel guides. The Japanese-English Bilingual Kyoto Lexicon is also available. This lexicon was created by extracting the Japanese-English word pairs from this corpus. Sample One Wikipedia article is stored as one XML file in this corpus, and the corpus contains 14,111 files in total. The following is a short quotation from a corpus file titled “Ryoan-ji Temple”. Each tag has different implications. For example: `
Values of richness estimation (underlined) obtained using Chao and Jackknife algorithms on morphotypes (MT) and sequence variants (SV). The richness is estimated for each phylum and for the total sampled area. Actual richness identified with each method is shown for a direct comparison. Species were estimated using both the focal phyla (SV) and whole meiofauna dataset (SV2). “*” indicates phyla which richness estimated using metabarcoding is lower than number of actual morphotypes identified with morphological taxonomy.
Context Dummy data to demo matplotlib Content 43 CSV rows of sales (qty and price) of 5 products in 3 regions by 11 reps Acknowledgements https://www.wintellect.com https://www.superdatascience.com/ Inspiration Thanks!
Dataset I used for the analyses of DNA barcoding gaps, species identification efficiency, sequence length and GC content
Resized and compressed CelebA dataset by http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
Project: Cloudbase: satellite-derived cloud base heights - The Cloudbase project aims to provide datasets of satellite-derived cloud base heights and their uncertainties. Project website: https://home.uni-leipzig.de/~jmuelmen/projects/ and https://home.uni-leipzig.de/~jmuelmen/projects/precerf.html This research was funded by the European Union under ERC Starting Grant QUAERERE, grant agreement 306284, and by the United States National Science Foundation under grant agreements AGS-1013423 and AGS-1048995. Summary: Attenuated backscatter profiles from the CALIOP satellite lidar are used to estimate cloud base heights of lower-troposphere liquid clouds (cloud base height below approximately 3 km). Even when clouds are thick enough to attenuate the lidar beam (optical thickness > 5), the technique provides cloud base heights by treating the cloud base height of nearby thinner clouds as representative of the surrounding cloud field. Using ground-based ceilometer data, uncertainty estimates for the cloud base height product at retrieval resolution are derived as a function of various properties of the CALIOP lidar profiles. Evaluation of the predicted cloud base heights and their predicted uncertainty using a second, statistically independent, ceilometer dataset shows that cloud base heights and uncertainties are biased by less than 10%. CBASE provides two files for each CALIOP VFM input file: one using a 40 km window to detect the cloud field base height, and one using a 100 km window. (The input CALIOP VFM dataset is organized by the daytime/nighttime half of each orbit.) The file name pattern is CBASE_T.nc (identical to the input CALIOP VFM file name with the exception of the product name). Files are organized into subdirectories by half-orbit start date.
The purpose of this Brainhack project was to create a simple application, with the least dependencies, for anonymization of DICOM files directly on a workstation. Anonymization of DICOM datasets is a requirement before an imaging study can be uploaded in a web-based database system, such as LORIS. Currently, a simple and efficient interface for the anonymization of such imaging datasets, which works on all operating systems and is very light in terms of dependencies, is not available. Here, we created a DICOM anonymizer that is a simple graphical tool that uses PyDICOM package to anonymize DICOM datasets easily on any operating system, with no dependencies except for the default Python and NumPy packages. DICOM anonymizer is available for all UNIX systems (including Mac OS) and can be easily installed on Windows computers as well (see PyDICOM installation). The GUI (using tkinter) and the processing pipeline were designed in Python. Executing the anonymizer_gui.py script with a python compiler will start the program. Figure 1 illustrates how to use the program to anonymize a DICOM study. The DICOM anonymizer is a simple standalone graphical tool that facilitates anonymization of DICOM datasets on any operating system. These anonymized studies can be uploaded to a web-based database system, such as LORIS, without compromising the patient or participant’s identity.
population genetic dataset input file (119 individuals, 2,111 SNPs)
Copyright information:Taken from "From co-expression to co-regulation: how many microarray experiments do we need?"Genome Biology 2004;5(7):R48-R48.Published online 28 Jun 2004PMCID:PMC463312.Copyright © 2004 Yeung et al.; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article\'s original URL. We applied different clustering algorithms to cluster the genes in yeast microarray datasets with different sizes to identify co-expressed genes. The level of co-regulation is evaluated using yeast transcription factor databases (SCPD and YPD) and ChIP data. The clustering results are then evaluated by determining the fraction of gene pairs from the same clusters that share at least one known common transcription factor.
Mean body sizes, approximated from the natural logarithm of the lower first or second molar area, and the proposed evolutionary relationships between mammalian genera from the middle and late Clarkforkian (Cf2 to Cf3) of the Bighorn and Clarks Fork Basins, Wyoming, USA. For details of the dataset see caption for electronic supplementary material, dataset S1.
The dataset includes fMRI raw data related to the paper entitled Lerner, Scherf, Katkov, Hasson and Behrmann (2018). Age-related differences in reliability of cortical activity under naturalistic viewing conditions.
Dataset used for the article Factors associated with Intrauterine Growth Restriction in Zimbabwean women: A Secondary Data Analysis
Project: WOCE-Argo Global Hydrographic Climatology - The WAGHC is a full-depth one-fourth degree resolution temperature and salinity climatology describing the mean state of the World Ocean between 1996 and 2011, with monthly gridded fields available between the surface and 1778 m depth. World Ocean Database 2013 (Locarnini et al., 2013) provided the majority of the temperature and salinity profiles, whereas data from the Alfred-Wegener-Institute, Bremerhaven, and from several Canadian Institutes helped to improve the data basis for the North Polar region considerably. A rigorous data quality control procedure was applied to the original profiles to exclude erroneous and highly untypical data. The spatial interpolation of the quality-controlled data was performed both on isobaric and isopycnal levels, so that essentially two climatologies are available. The isopycnally-averaged climatology mimics the process of isopycnal mixing in the real ocean and therefore is less prone to the production of artificial water masses. The WAGHC climatology represents the update of the WOCE Global Hydrographic Climatology (Gouretski and Koltermann, 2004). The name of the new climatology was chosen to highlight both the outstanding role of the WOCE hydrographic data for the historical global hydrographic archive and the importance of the more recent data from the Argo floats. Web-link: http://icdc.cen.uni-hamburg.de/1/daten/ocean/waghc/ References: Locarnini, R. A., A. V. Mishonov, J. I. Antonov, T. P. Boyer, H. E. Garcia, O. K. Baranova, M. M. Zweng, C. R. Paver, J. R. Reagan, D. R. Johnson, M. Hamilton, D. Seidov (2013) World Ocean Atlas 2013, Volume 1: Temperature. S. Levitus, Ed., A. Mishonov Technical Ed.; NOAA Atlas NESDIS 73, 40 pp. Gouretski, V., Koltermann, K.(2004) WOCE Global Hydrographic Climatology, Berichte des BSH, 35, 52pp., ISSN: 0946-6010. Funder: The work was conducted as part of the Excellence Initiative CLISAP at the Universität Hamburg, funded through the German Science Foundation (Grant EXC 177/2) Summary: The WOCE/ARGO Global Hydrographic Climatology (WAGHC) is concieved as the update of the previous WOCE Global Hydrographic Climatology (WGHC) (Gouretski and Koltermann, 2004). The following improvements have been made compared to the WGHC: 2) finer spatial resolution (0.25 degrees Lat/Lon compared to 0.5 degrees for WGHC); 3) finer vertical resolution (65 compared to 45 WGHC standard levels); 4) monthly temporal resolution compared to the all-data-mean WGHC parameters; 5) narrower overall time period; 6) calculation of the mean year corresponding to the optimally interpolated temperature and salinity values; 7) depth of the upper mixed layer. Similar to the WGHC the optimal spatial interpolation is performed on the local isopycnal surfaces. This approach diminishes the production of the artificial water masses. In addition to the isopycnally interpolated parameters parameter values interpolated on the isobaric levels are also provided. The monthly gridded vertical profiles extend to the depth of 1898 m, below only annual mean parameter values are available. Additionally, there is a dataset and a map available providing indexes for selected regions of the world ocean. Finally, the comparison with the last update of the NOAA World Ocean Atlas (Locarnini et al, 2013) was done.
This dataset contains: gene expression values, physiological parameters, Q_values, Minisatellites length, MHCIIb alleles, parasitological parameters.
Reconstructed slices of a 3D tomographic dataset after ring artifacts suppression using the wavelet-FFT-based method.
This dataset is supplementary to the article of Scherler et al. (submitted), in which the global distribution of supraglacial debris cover is mapped and analyzed. For mapping supraglacial debris cover, we combined glacier outlines from the Randolph Glacier Inventory (RGI) version 6.0 (RGI consortium, 2017) with remote sensing-based ice and snow identification. Areas that belong to glaciers but that are neither ice nor snow were classified as debris cover. This dataset contains the outlines of the mapped debris-covered glaciers areas, stored in shapefiles (.shp). For creating this dataset, we used optical satellite data from Landsat 8 (for the time period 2013-2017), and from Sentinel-2A/B (2015-2017). For the ice and snow identification, we used three different algorithms: a red to short-wavelength infrared (swir) band ratio (RATIO; Hall et al., 1988), the normalized difference snow index (NDSI; Dozier, 1989), and linear spectral unmixing-derived fractional debris cover (FDC; e.g., Keshava and Mustard, 2002). For a detailed description of the debris-cover mapping and an analysis of the data, please see Scherler et al. (2019) to which these data are supplementary material. This dataset includes debris cover outlines based on either Landsat 8 (LS8; 30-m resolution) or Sentinel 2 (S2; 10-m resolution), and the three algorithms RATIO, NDSI, FDC. In total, there exist six different zip-files that each contain 19 shapefiles. The structure of the shapefiles follows that of the RGI version 6.0 (RGI consortium, 2017), with one shapefile for each RGI region. The original RGI shapefiles provide each glacier as one entry (feature) and include a variety of ancillary information, such as area, slope, aspect (RGI Consortium 2017a, Technical Note p. 12ff). Because the debris-cover outlines are based on the RGI v6.0 glacier outlines, all fields of the original shapefiles, which refer to the glacier, are retained, and expanded with four new fields: - DC_Area: Debris-covered area in m². Note that this unit for area is different from the unit used for reporting the glacier area (km²).- DC_BgnDate: Start of the time period from which satellite imagery was used to map debris cover.- DC_EndDate: End of the time period from which satellite imagery was used to map debris cover.- DC_CTSmean: Mean number of observations (CTS = COUNTS) per pixel and glacier. This number is derived from the number of available satellite images for the respective time period, reduced by filtering pixels due to cloud and snow cover. The dataset has a global extent and covers all of the glaciers in the RGI v. 6.0, but it exhibits poor coverage in the RGI region Subantarctic and Antarctic, where the debris cover extents are based on very few observations.
Datasets obtained from the simulator of gametic phase disequilibrium between two loci + R script for producing the figures presented in the paper
This presentation describes our current research to a layman's audience. It describes the National Health and Nutrition Examination Survey (NHANES) and our use of this publicly available dataset for the automatic discovery of associations between study abstracts and variables in the NHANES. This approach can be generalized to other scientific domains to gain insight in published literature.
Inspiration What did we all upload to kaggle actually? And how did the community responded? We can find it out via looking at this dataset of the datasets. Content This dataset is in a csv format, where each column is the features and attributes of a dataset on Kaggle (e.g. tags, filetype, no. of Kernels, etc.) and each row is a dataset on Kaggle Acknowledgements Thanks kaggle for the super easy api endpoint design!
Context The vast majority of food and food ingredients eaten today is processed in some way before they arrived at the kitchen or dinner table. Food processing equipment may leave trace amounts of various industrial chemical compounds in the foods we eat, and these chemicals, classed **indirect food additives**, are regulated by the United States Food and Drug Administration. This dataset is a list of indirect food additives approved by the FDA. Content This dataset contains the names of chemical compounds and references to the federal government regulatory code approving and controlling their usage. Acknowledgements This dataset is published by the FDA and available [online](https://www.accessdata.fda.gov/scripts/fdcc/?set=IndirectAdditives) as a for-Excel `CSV` file. A few errant header columns have been cleaned up prior to upload to Kaggle, but otherwise the dataset is published as-is. Inspiration * What tokens most commonly appear amongst the names contained in this list? * Any identifiable elements or compounds?
Context Venue names and geo-coordinates of venues in New York City Content Venue names, latitude and longitude of venues in New York City Acknowledgements The venue names in New York City are fetched from : https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/locrec/gowalla-dataset.zip The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant PTDC/EIA-EIA/109840/2009.
Raw data for the HR 114 pollen surface sample dataset obtained from the Neotoma Paleoecological Database.
Tillage is a central element in agricultural soil management and has direct and indirect effects on processes in the biosphere. Effects of agricultural soil management can be assessed by soil, crop, and ecosystem models but global assessments are hampered by lack of information on type and spatial distribution. This dataset is the result of a study on global classification of tillage practices and the spatially explicit mapping of crop-specific tillage systems for around the year 2005. This global gridded tillage system data set is dedicated to modeling communities interested in the quantitative assessment of biophysical and biogeochemical impacts of land use and soil management on cropland. The data set is complemented by the publication of the R- code and can be used for reproducing and build upon for scenarios including the expansion of sustainable soil management practices as Conservation Agriculture (Porwollik et al. 2018, http://doi.org/10.5880/PIK.2018.013). Both, the data set and the R-code are described in detail in Porwollik et al. (2018, ESSD). We present the mapping result of six tillage systems for 42 crop types and potential suitable Conservation Agriculture area as the following variables: We present the mapping result of six tillage systems for 42 crop types and potentially suitable Conservation Agriculture area as variables:1 = conventional annual tillage2 = traditional annual tillage3 = reduced tillage4 = Conservation Agriculture5 = rotational tillage6 = traditional rotational tillage7 = potential suitable Conservation Agriculture area Reference system: WGS84Geographic extent: Longitude (min, max) (-180, 180), Latitude (min, max) (-56, 84)Resolution: 5 arc-minutesTime period covered: around the year 2005Type: NetCDF Dataset sources (with indication of reference): 1. Grid cell allocation key to country: IFPRI/IIASA (2017, cell5m_allockey_xy.dbf.zip)2. Crop-specific physical cropland: IFPRI/IIASA (2017, spam2005v3r1_global_phys_area.geotiff.zip)3. SoilGrids depth to bedrock: Hengl et al. (2014)4. Aridity index: FAO (2015)5. Conservation Agriculture area: FAO (2016)6. Income level: World Bank (2017)7. Field size: Fritz et al. (2015)8. Water erosion: Nachtergaele et al. (2011)
The dataset contains Number of Air passengers of each month from the year 1949 to 1960. We can use this data to forecast the future values and help the business.
Movie Data Set This is a movie data set consisting of 3886 films scraped from [Hydra Movies full collection of movies.][1] Although there is a data dump available via their API - the data they release does not include cast, writers, directors or the short summary. Content For each of the 3886 movies you will find the following data: - Movie Title - Release Year - Summary Long - Summary Short - IMDB ID - Runtime - YouTube Trailer Code - IMDB Rating - Movie Poster (URL path) - Directors - Writers - Cast Inspiration This is a more complete data set than the public data dump via the [Hydra Movies API.][2] Hopefully you will find it more useful. [1]: https://hydramovies.com/ [2]: https://hydramovies.com/api/
Fruits 360 dataset: A dataset of images containing fruits Version: 2018.09.07.0 Content The following fruits are included: Apples (different varieties: Golden, Golden-Red, Granny Smith, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red), Cactus fruit, Cantaloupe (2 varieties), Carambula, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Clementine, Cocos, Dates, Granadilla, Grape (Pink, White, White2), Grapefruit (Pink, White), Guava, Huckleberry, Kiwi, Kaki, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine, Orange, Papaya, Passion fruit, Peach, Pepino, Pear (different varieties, Abate, Monster, Williams), Physalis (normal, with Husk), Pineapple (normal, Mini), Pitahaya Red, Plum, Pomegranate, Quince, Rambutan, Raspberry, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red), Walnut. Dataset properties Total number of images: 55244. Training set size: 41322 images (one fruit per image). Test set size: 13877 images (one fruit per image). Multi-fruits set size: 45 images (more than one fruit (or fruit class) per image) Number of classes: 81 (fruits). Image size: 100x100 pixels. Filename format: image_index_100.jpg (e.g. 32_100.jpg) or r_image_index_100.jpg (e.g. r_32_100.jpg) or r2_image_index_100.jpg. "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels). Different varieties of the same fruit (apple for instance) are stored as belonging to different classes. How we made it Fruits were planted in the shaft of a low speed motor (3 rpm) and a short movie of 20 seconds was recorded. A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available. Behind the fruits we placed a white sheet of paper as background. However due to the variations in the lighting conditions, the background was not uniform and we wrote a dedicated algorithm which extract the fruit from the background. This algorithm is of flood fill type: we start from each edge of the image and we mark all pixels there, then we mark all pixels found in the neighborhood of the already marked pixels for which the distance between colors is less than a prescribed value. We repeat the previous step until no more pixels can be marked. All marked pixels are considered as being background (which is then filled with white) and the rest of pixels are considered as belonging to the object. The maximum value for the distance between 2 neighbor pixels is a parameter of the algorithm and is set (by trial and error) for each movie. Published research papers Horea Muresan, [Mihai Oltean](https://mihaioltean.github.io), [Fruit recognition from images using deep learning](https://www.researchgate.net/publication/321475443_Fruit_recognition_from_images_using_deep_learning), Acta Univ. Sapientiae, Informatica Vol. 10, Issue 1, pp. 26-42, 2018. The paper introduces the dataset and an implementation of a Neural Network trained to recognized the fruits in the dataset. Alternate download This dataset is also available for download from GitHub: [Fruits-360 dataset](https://github.com/Horea94/Fruit-Images-Dataset) History Fruits were filmed at the dates given below (YYYY.MM.DD): 2017.02.25 - Apple (golden). 2017.02.28 - Apple (red-yellow, red, golden2), Kiwi, Pear, Grapefruit, Lemon, Orange, Strawberry, Banana. 2017.03.05 - Apple (golden3, Braeburn, Granny Smith, red2). 2017.03.07 - Apple (red3). 2017.05.10 - Plum, Peach, Peach flat, Apricot, Nectarine, Pomegranate. 2017.05.27 - Avocado, Papaya, Grape, Cherrie. 2017.12.25 - Carambula, Cactus fruit, Granadilla, Kaki, Kumsquats, Passion fruit, Avocado ripe, Quince. 2017.12.28 - Clementine, Cocos, Mango, Lime, Lychee. 2017.12.31 - Apple Red Delicious, Pear Monster, Grape White. 2018.01.14 - Ananas, Grapefruit Pink, Mandarine, Pineapple, Tangelo. 2018.01.19 - Huckleberry, Raspberry. 2018.01.26 - Dates, Maracuja, Salak, Tamarillo. 2018.02.05 - Guava, Grape White 2, Lemon Meyer 2018.02.07 - Banana Red, Pepino, Pitahaya Red. 2018.02.08 - Pear Abate, Pear Williams. 2018.05.22 - Lemon rotated, Pomegranate rotated. 2018.05.24 - Cherry Rainier, Cherry 2, Strawberry Wedge. 2018.05.26 - Cantaloupe (2 varieties). 2018.05.31 - Melon Piel de Sapo. 2018.06.05 - Pineapple Mini, Physalis, Physalis with Husk, Rambutan. 2018.06.08 - Mulberry. 2018.06.16 - Walnut, Tomato Cherry Red. 2018.06.17 - Cherry Wax (Yellow, Red, Black). 2018.08.19 - Tomato Maroon, Tomato 1-4. License MIT License Copyright (c) 2017-2018 [Mihai Oltean](https://mihaioltean.github.io), Horea Muresan Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
This dataset contain the stimuli, response data, spss and Excel data files belonging to the paper entitled "Visualizing uncertainty in individual growth predictions in population charts"which has been submitted for publication in PeerJ.
This archive contains data files for the sediment properties (distribution and grain size) and bedrock geology for the areas covered by North American ice sheets (including Greenland and Iceland). These datasets are distributed as shapefiles and NetCDF files. These files are intended for use in ice sheet models.
The datasets were obtained by Tilman et al. (2001: Tilman, D., Knops, J., Wedin, D., Reich, P., Ritchie, M. & Siemann, E. Diversity and productivity in a long-term grassland experiment. Science, 294, 843-846) and Langenheder et al. (2010: Langenheder, S., Bulling, M. T., Solan, M. & Prosser, J. I. Bacterial Biodiversity-Ecosystem functioning Relations Are Modified by Environmental Complexity. PLoS ONE, 5, e10834. doi:10.1371/journal.pone.0010834). The R-files were written by Camille Richon and Benoît Jaillard.
Details of the road network in the Scottish Borders area of the UK, formatted using Resource Description Framework (RDF) stored in a Jena TDB dataset (see http://jena.apache.org/documentation/tdb/). Data format is based on the OpenStreetMap (http://www.openstreetmap.org/map=5/51.500/-0.100) representation of ways as a series of nodes.
Context There is a numpy array of Indian pine open source dataset with its ground truth numpy array. The data is small in size (145x145x220) , and is good introduction to Hyperspectral Remote Sensing.
**Connect/Follow me on [LinkedIn](http://link.rajanand.org/linkedin) for more updates on interesting dataset like this. Thanks.** Context This data set contains yearly suicide detail of all the states/u.t of India by various parameters from 2001 to 2012. Content Time Period: 2001 - 2012 Granularity: Yearly Location: States and U.T's of India Parameters: a) Suicide causes b) Education status c) By means adopted d) Professional profile e) Social status Acknowledgements National Crime Records Bureau (NCRB), Govt of India has shared this [dataset](https://data.gov.in/dataset-group-name/accidental-deaths-and-suicides) under [Govt. Open Data License - India](https://data.gov.in/government-open-data-license-india). NCRB has also shared the historical data on their [website](http://ncrb.nic.in/StatPublications/ADSI/PrevPublications.htm)
Context This is a dataset put together to allow data scientists to put their skills to the test against the efficiency of the horse racing betting market. It will be a great challenge for all data scientists to find out whether they are able to create a model that outperforms the market prices. Content The data includes results for all races, starting prices and 101 explanatory variables for each runner. Inspiration Is it possible to beat the horse racing betting market? Where is the market more efficient or less efficient? Is it possible to use the market price in conjunction with other variables to come up with a more accurate prediction?
Context The Museum of Modern Art (MoMA) acquired its first artworks in 1929, the year it was established. Today, the Museum’s evolving collection contains almost 200,000 works from around the world spanning the last 150 years. The collection includes an ever-expanding range of visual expression, including painting, sculpture, printmaking, drawing, photography, architecture, design, film, and media and performance art. Content MoMA is committed to helping everyone understand, enjoy, and use our collection. The Museum’s website features 72,706 artworks from 20,956 artists. The artworks dataset contains 130,262 records, representing all of the works that have been accessioned into MoMA’s collection and cataloged in our database. It includes basic metadata for each work, including title, artist, date, medium, dimensions, and date acquired by the Museum. Some of these records have incomplete information and are noted as “not curator approved.” The artists dataset contains 15,091 records, representing all the artists who have work in MoMA's collection and have been cataloged in our database. It includes basic metadata for each artist, including name, nationality, gender, birth year, and death year. Inspiration Which artist has the most works in the museum collection or on display? What is the largest work of art in the collection? How many pieces in the collection were made during your birth year? What gift or donation is responsible for the most artwork in the collection?
Snow depths and bulk densities of the annual snow layer were measured at 69 different locations on glaciers across Nordenskiöldland, Svalbard, during the spring seasons of the period 2014–2016. Sampling locations lie along nine transects extending over 17 individual glaciers. Several of the locations were visited repeatedly, leading to a total of 109 point measurements, on which we report in this study. Snow water equivalents were calculated for each point measurement. In the dataset, snow depth and density measurements are accompanied by appropriate uncertainties which are rigorously transferred to the calculated snow water equivalents using a straightforward Monte Carlo simulation-style procedure. The final dataset can be downloaded from the Pangaea data repository (https://www.pangaea.de; https://doi.org/10.1594/PANGAEA.896581). Snow cover data indicate a general and statistically significant increase of snow depths and water equivalents with terrain elevation. A significant increase of both quantities with decreasing distance towards the east coast of Nordenskiöldland is also evident, but shows distinct interannual variability. Snow density does not show any characteristic spatial pattern.
Fasta files containing the concatenated multigene aignments of the datasets listed in Supplemental Table S4.
Content of Excel files:--BASE calculation.xlsx:This is the core dataset for "Assessing the Efficiency of Land Use Changes for Mitigating Climate Change" using a 4% discount rate for all calculations.--Sensitivity variant DISC 2%.xlsxSensitivity variant DISC 6%.xlsx:Sensitivity calculations using 2% and 6% discount rates.--Sensitivity variant GAIN.xlsxCalculations using the carbon gain method.--Sensitivity variant HIGH.xlsxSensitivity variant LOW.xlsx:Calculations using +/- 20% variations for native vegetation estimates and +/- 30% for soil carbon estimates under native vegetation. The HIGH variant uses 20% and 30% higher values for vegetation and soil carbon stocks in native vegetation, respectively, than the central scenario. The LOW variant uses 20% and 30% lower values for vegetation and soil carbon stocks in native vegetation, respectively,than the central scenario.---------------Description of raster files:--lpjml_anpp_avg_2001-2010.asc:Annual net primary productivity of potential native vegetation under current climate simulated with the LPJmL model.--lpjml_vegc_avg_cor_2001-2010.asc:Above- and below-ground carbon stocks of potential natural vegetation under current climate simulated with the LPJmL model and adjusted at the biome level according to reference values from the literature (see Supplementary Information).--lpjml_soilc_1m_avg_cor_2001-2010.asc:Soil carbon stocks of potential natural vegetation under current climate simulated with LPJmL and adjusted at the biome level according to reference values from the literature (see Supplementary Information).
The dataset contains the unigenes from the longest contigs per transcripts generated by Trinity. The fb.flower bud.Unigene.fa file contains unigenes from flower of P. equestris, the L5.root.Unigene.fa file are unigenes from root of P. equestris, the L6.stem.Unigene.fa file contains unigenes from stem of P. equestris, the PHA.leaf. Unigene.fa file contains unigenes from leaf of P. equestris. 12_day.unigene.fasta, 7_day.unigene.fasta and 4_day.unigene.fasta files are unigenes from seeds respectively taken from sowing on 1/2 MS medium for 12 days, 7 days and 4 days. sepal.unigene.fasta, petal.unigene.fasta, lip.unigene.fasta and column.unigene.fasta files are unigenes from sepal, petal, lip and column.
This dataset includes journal specific information aggregated from the mydata dataset including the journal specific delta values for all 114 journals in consideration.
Illustration of the concept that transcription expression profiles (non-normalized) of regulator YML027W (YOX1, red line) and regulator YMR016C (SOK2, blue line) are dynamically combined. This demonstrates a significant match between the combinatorial expression profile and the expression of the target gene YOR039W (CKB2) in the studied dataset. The conversion efficiency, which indicates the ratio between the number of functional activated binding regulators and the number of available transcription factor transcripts, is presented as a percentage (10% and 70% here).Copyright information:Taken from "Dynamic cumulative activity of transcription factors as a mechanism of quantitative gene regulation"http://genomebiology.com/2007/8/9/R181Genome Biology 2007;8(9):R181-R181.Published online 4 Sep 2007PMCID:PMC2375019.
endometrial cancer dataset from TCGA
Supplementary Dataset 1. Sequence alignment used for phylogenetic analysis of C3 complement sequences. (FASTA 47 kb)
Context This corpus contains 5001 female names and 2943 male names, sorted alphabetically, one per line created by Mark Kantrowitz and redistributed in NLTK. The `names.zip` file includes - README: The readme file. - female.txt: A line-delimited list of words. - male.txt: A line-delimited list of words. License/Usage Names Corpus, Version 1.3 (1994-03-29) Copyright (C) 1991 Mark Kantrowitz Additions by Bill Ross This corpus contains 5001 female names and 2943 male names, sorted alphabetically, one per line. You may use the lists of names for any purpose, so long as credit is given in any published work. You may also redistribute the list if you provide the recipients with a copy of this README file. The lists are not in the public domain (I retain the copyright on the lists) but are freely redistributable. If you have any additions to the lists of names, I would appreciate receiving them. Mark Kantrowitz
This dataset contains key characteristics about the data described in the Data Descriptor CommonMind Consortium provides transcriptomic and epigenomic data for Schizophrenia and Bipolar Disorder. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format
Copyright information:Taken from "Osprey: a network visualization system"Genome Biology 2003;4(3):R22-R22.Published online 27 Feb 2003http://www.ncbi.nlm.nih.gov/pmc/articles/PMC153462.Copyright © 2003 Breitkreutz et al.; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article\'s original URL. Network containing 2,245 vertices and 6,426 edges from combined datasets of Gavin [10], shown in red, and Ho [11], shown in yellow. A source filter reveals only those interactions shared by both datasets, namely 212 vertices and 188 edges.
Comprehensive hydrometeorologcial dataset collected at experimental farms in the University of Melbourne's Dookie Campus.
Dataset contains summary results data from the 2013 meta-analysis of Genome-wide Association data in Alzheimer's disease produced by the International Genomics of Alzheimer's Project (IGAP).Data set corresponds to the meta-analysis results of the 11,632 SNPs that were genotyped and tested for association in an independent set of 8,572 Alzheimer's disease cases and 11,312 controls with the combined stage1/stage2 P-values. Details are described in the publication https://www.ncbi.nlm.nih.gov/pubmed/?term=24162737.
List of the genes composing the merged matrix of all transcriptomic data from datasets and PCD samples (n=9939).
Code Imports import numpy as np import pandas as pd from sklearn.ensemble import RandomForestClassifier Data import train = pd.read_csv("titanic_data/train.csv", dtype={"Age": np.float64}, ) test = pd.read_csv("titanic_data/test.csv", dtype={"Age": np.float64}, ) Convert the male and female groups to integer form trainSex[trainSex == "male = 0 trainSex[trainSex == "female = 1 testSex[testSex == "male = 0 testSex[testSex == "female = 1 Impute the Embarked and Age variable trainAge = trainAge.fillna(trainAge.median()) trainFare = trainFare.fillna(trainFare.median()) testAge = testAge.fillna(testAge.median()) testFare = testFare.fillna(testFare.median()) We want the Pclass, Age, Sex, Fare, SibSp and Parch variables features_forest = train[Pclass", "Age", "Sex", "Fare", "SibSp", "Parch].values target = trainSurvived.values print(features_forest[0]) Building and fitting my_forest forest = RandomForestClassifier(max_depth = 25, min_samples_split=10, min_samples_leaf=10, n_estimators = 1000, random_state = 1) my_forest = forest.fit(features_forest, target) Print the score of the fitted random forest print(my_forest.score(features_forest, target)) Compute predictions on our test set features then print the length of the prediction vector test_features = test[Pclass", "Age", "Sex", "Fare", "SibSp", "Parch].values pred_forest = my_forest.predict(test_features) PassengerId = np.array(testPassengerId).astype(int) my_solution = pd.DataFrame( {\'PassengerId\': PassengerId, \'Survived\': pred_forest} ) pd.set_option(\'display.max_rows\', 500) my_solution.to_csv("titanic_random-forest.csv", index=False) print(len(pred_forest))
Context Violent Crime Rates by US State Content This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. Acknowledgements World Almanac and Book of facts 1975. (Crime rates). Statistical Abstracts of the United States 1975. (Urban rates). <h3>References</h3> McNeil, D. R. (1977) <em>Interactive Data Analysis</em>. New York: Wiley. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
This dataset is a subset of the Yelp Challenge, it contains all the reviews in the year of 2013
GDSC: Genomics of Drug Sensitivity in CancerCCLE:Cancer Cell Line EncyclopediagCSI: genentech Cell Screening InitiativeGRAY: A pharmacogenomic dataset of 70 breast cancer cell line from Joe Gray labUHN: A pharmacogenomic dataset of 84 breast cancer cell line from University Health Network
Dataset of material properties used to predict dielectric constants. Note on citations: If you found this dataset useful and would like to cite it in your work, please be sure to cite its original sources below rather than or in addition to this page.Dataset described in the following publication:Petousis I, Mrdjenovich D, Ballouz E, Liu M, Winston D, Chen W, Graf T, Schladt TD, Persson KA, Prinz FB (2017) High-throughput screening of inorganic compounds for the discovery of novel dielectric and optical materials. Scientific Data 4: 160134. https://doi.org/10.1038/sdata.2016.134 Dataset was adapted by Hacking Materials group from json files originally sourced from Dryad (see references 3-4 below).Petousis I, Mrdjenovich D, Ballouz E, Liu M, Chen W, Graf T, Schladt TD, Persson KA, Prinz FB (2017) Data from: High-throughput screening of inorganic compounds for dielectric and optical properties to enable the discovery of novel materials. Dryad Digital Repository. https://doi.org/10.5061/dryad.ph81h
The data relates to the PhD thesis submitted to Cardiff University in candidature for the degree ofDoctor of Philosophy by Muditha Abeysekera. The thesis presents research undertaken to develop a model for the combined steady state simulation and operation planning of integrated energy supply systems. As part of the thesis, three key components of the model were developed i.e: 1) Optimal power dispatch of an integrated energy system: A real case study was used to demonstrate the economic benefits of considering the interactions between different energy systems in their design and operation planning. This work is presented in Chapter 3 of the PhD thesis. The data related to this work is available in the XLS file titled 'Chapter 3_Optimal power dispatch_Dataset’. These data provide, for the period 1/4/2014 - 31/3/2015, half-hourly figures for electricity demand (kW) and from two centres, the heat demand (kW), and optimal gas input to gas boiler (kW), optimal gas input to CHP unit (kW), optimal electricity input (kW), optimal electric chiller electricity input (kW), optimal heat input to absorption chiller (kW), marginal cost of electricity (£), marginal cost of heat (£), marginal cost of cooling (£) in terms of electricity demand (kW) and heat demand (kW) in 500kW bins. 2) Simultaneous steady state analysis of coupled energy networks: An example of a coupled electricity, gas, district heating and district cooling network system was used to illustrate the formulation of equations and the iterative solution method. A case study was carried out to demonstrate the application of the method for integrated energy network analysis. This work is presented in Chapter 5 of the PhD thesis. The data related to this work is available in the XLS file titled ‘Chapter 5_case study_Data set’. The data provide, for 3 cases: electricity network results - for bus bar, energy demands (MW), power generation (MW) and voltage magnitude and voltage angle, and branch results comprising active power and re-active power to and from bus injection, and real and reactive power losses; gas network information - for gas node, gas demand, CHP gas demand, boiler gas demand, total gas supply and gas pressure, and pipe data for gas flow rate and pressure drop; heat network information - for hear node, fixed heat demand, absorption chiller heat demand, total heat demand, heat supply, mass flow into node, supply-line and return-line temperatures, supply-line and return-line pressures, and pubmping power, and branch information of mass flow rate, supply-line and return-line heat loss, line heat loss, supply-line and return-line temperature loss, line heat loss, supply-line and return-line termperature drop, branch pressure drop and branch specific pressure drop; cooling network information - for cooling node, fixed cooling demand, cooling supply, mass flow into node, supply-line and return-line temperature, supply-line and return-line pressure, pumping power, and for branch, mass flow rate, supply-line heat gain, return-lin heat loss and branch pressure drop. 3) Steady state analysis of gas networks with the distributed injection of alternative gases: A case study was carried out to demonstrate the impact of alternative gas injections on the pressure delivery and gas quality in the network. This work is presented in Chapter 6 of the PhD thesis. The data related to this work is available in the XLS file titled ‘Chapter 5_case study_Data set’. Data provides: node pressure (mbar) and branch flow-rate(m^3/hr); pressure (mbar) for the nodes under hydrogen-enriched natural gas mixture and for upgraded biogas mixture; for the nodes the actual energy demand (kJ/s) and available energy; pressure (mbar) and wobbe index at nodes for two methodes and flow rate in branches, for the two methods, over various periods.
Manual stance and veracity judgments over a large set of English social media conversations. Judgments are expertsourced and crowdsourced with extensive quality control as detailed in the referenced paper. Conversations are around a central claim; claims are grouped into news themes, each theme centering on a current event. Social media platforms include Twitter and Reddit. This is the dataset for the SemEval-2019 task, RumourEval.
Copyright information:Taken from "A new computational approach to analyze human protein complexes and predict novel protein interactions"http://genomebiology.com/2007/8/12/R256Genome Biology 2007;8(12):R256-R256.Published online 4 Dec 2007PMCID:PMC2246258. The number of complexes with a best value equal to or lower than the corresponding one on the x-axis is plotted for three non-synchronized and stressed HeLa datasets at a fixed FDR: dithiothreitol (DTT); heat shock; tunicamycin.
Context When Twitter introduced its thread functionality, a debate emerged: "If you\'re gonna write a f*ck ton of tweets at once, why not write a blog post instead of cluttering my feed?"... "It\'s easier and user-friendlier to share ideas in a single app"... I\'m not getting into that debate. Both blog posts and Twitter threads have their own advantages. But I noticed a phenomenon while reading threads on Twitter: **the engagement—*retweets, likes and replies*—drops with each subsequent tweet!** Now, this has some logical explanations. Like, people don\'t want to retweet or like *every* tweet in a thread, because that\'d be annoying. But this trend kept appearing in every single thread I read. It was bugging me, so I had to gather some data. Content The dataset is divided into **five** parts: - `five_ten.csv`: data of threads 5-10 tweets long - `ten_fifteen.csv`: data of threads 10-15 tweets long - `fifteen_twenty.csv`: data of threads 15-20 tweets long - `twenty_twentyfive.csv`: data of threads 20-25 tweets long - `twentyfive_thirty.csv`: data of threads 25-30 tweets long They all contain the same data: - `id`: Tweet ID (maybe I should remove it to anonymize the data?) - `thread_number`: Thread identifier, used for grouping each thread and its tweets - `timestamp`: Creation date of each tweet - `text`: The content of each tweet - `retweets`: Retweet count for each tweet - `likes`: Like count for each tweet - `replies`: Reply count for each tweet Each "bin" contains around 100 threads... so in total there are ~500 threads. Acknowledgements The threads were manually gathered using [Thread Reader][1] (both the web page and the [bot][2]). Disclaimer The content of the threads/tweets **did not** had any influence in choosing a thread or not. The only parameter was the length of the thread (5-30 tweets tops). The tweets collected date from October 2017 to May 2018. Inspiration Some things I noticed while gathering the data was that political threads have a steadier engagement than, say, art threads. So **context might influence thread engagement**, and it\'d be interesting to do some NLP to figure that out. Also it\'d be cool to find a "formula" for better engagement in Twitter threads, like how long should a thread be? or maybe a probability of engagement based on the success of the initial tweet? Finally, this whole issue reminds me of [the headline problem][3]: most people don\'t go beyond the headline. Maybe Twitter threads suffer from that too. [1]: http://threadreaderapp.com/ [2]: https://twitter.com/threadreaderapp [3]: https://www.washingtonpost.com/news/the-fix/wp/2014/03/19/americans-read-headlines-and-not-much-else/
Context This dataset is an exported version of the [Atlanta Crime Data Report](http://www.atlantapd.org/i-want-to/crime-data-downloads), a dataset on crimes in the city of Atalanta, Georgia published by the city's police department. Content This data is regarding crime data from the City of Atlanta. This area contains weekly crime reports commanders use to best deploy Atlanta officers to combat crime. It also contains a raw crime data dump that is updated weekly. Crime data in this area is counted by incident in the area. Acknowledgements The original source for this dataset is located [on the Atlanta PD website](http://www.atlantapd.org/i-want-to/crime-data-downloads). Inspiration What can you learn about crime in Atlanta from this dataset? How does it compare to crimes committed in other cities with data on Kaggle, like [New York City](https://www.kaggle.com/adamschroeder/crimes-new-york-city)?
Context This dataset was downloaded from INEP, a department from the Brazilian Education Ministry. It contains data from the applicants for the 2015 National High School Exam. Content Inside this dataset there are not only the exam results, but the social and economic context of the applicants. Acknowledgements The original dataset is provided by INEP (http://portal.inep.gov.br/microdados). I removed some information from original files to fit the file size into the Kaggle constraints. Inspiration The objective is to explore the dataset to achieve a better understanding of the social and economic context of the applicants in the exams results.
Context The DataScienceBowl covered the whole process of diagnosing lung cancer and I am to make the individual steps more clear. After segmenting lungs and identifying suspicious nodes, it is important to classify them as malignant or benign. Content This dataset consists of several thousand examples formatted in multipage TIFF (for use with tools like ImageJ and KNIME) and HDF5 (for Python and R). Acknowledgements The data were preprocessed and extracted partially from the LUNA16 competition (https://luna16.grand-challenge.org/description/) and should be used with the same policy that data has. Inspiration The dataset is more for practice with medical images and CNN's but it would be interesting to see how the best manually created features (HoG, SIFT, ...) perform against various Deep Learning approaches. It would also be quite interesting to try and visualize exactly which parts of an image made the algorithm guess malignant or benign.
Context There's a story behind every dataset and here's your opportunity to share yours. NBA Draft history 2012-2017 Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
This is a dataset from https://webhose.io/datasets/ containing company reviews. A portion of it is extracted to get a balanced number of positive and negative reviews as well as to reduce the size of the dataset.
Summary of the dataset with information on human subjects involved and indication of relevant file locations
Bug triage.
The compressed file contains the dataset and the source-code for the paper: Prediction of Pathological Stage in Patients with Prostate Cancer: A Neuro-Fuzzy ModelGeorgina Cosma, Giovanni Acampora, David Brown, Robert C. Rees,Masood Khan, A. Graham Pockley. PLoS ONE, 2016.
Context This dataset is a playground for fundamental and technical analysis. It is said that 30% of traffic on stocks is already generated by machines, can trading be fully automated? If not, there is still a lot to learn from historical data. Content Dataset consists of following files: - **prices.csv**: raw, as-is daily prices. Most of data spans from 2010 to the end 2016, for companies new on stock market date range is shorter. There have been approx. 140 stock splits in that time, this set doesn't account for that. - **prices-split-adjusted.csv**: same as prices, but there have been added adjustments for splits. - **securities.csv**: general description of each company with division on sectors - **fundamentals.csv**: metrics extracted from annual SEC 10K fillings (2012-2016), should be enough to derive most of popular fundamental indicators. Acknowledgements Prices were fetched from Yahoo Finance, fundamentals are from Nasdaq Financials, extended by some fields from EDGAR SEC databases. Inspiration Here is couple of things one could try out with this data: - One day ahead prediction: Rolling Linear Regression, ARIMA, Neural Networks, LSTM - Momentum/Mean-Reversion Strategies - Security clustering, portfolio construction/hedging Which company has biggest chance of being bankrupt? Which one is undervalued (how prices behaved afterwards), what is Return on Investment?
This dataset is for running the code from this site: https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8. This is how to show a picture from the training set: display(Image('../input/cat-and-dog/training_set/training_set/dogs/dog.423.jpg')) From the test set: display(Image('../input/cat-and-dog/test_set/test_set/cats/cat.4453.jpg')) See an example of using this dataset. https://www.kaggle.com/tongpython/nattawut-5920421014-cat-vs-dog-dl
Dataset of Nobel Prize winners for the articleA novel bibliometric index with a simple geometric interpretationTo be published in PloS OneAuthored byTrevor Fenner, Martyn Harris, Mark Levene and Judit Bar-Ilan
A new global ST dataset, the China Merged Surface Temperature (CMST) dataset is developed recently. CMST is created by merging the China-Land Surface Air Temperature (C-LSAT) with the sea surface temperature (SST) data from the Extended Reconstructed Sea Surface Temperature version 5 (ERSSTv5). readme of the data files CMST The CMST(China Merged Surface Temperature) is produced by merging data from the C-LSAT land surface air temperature dataset and the ERSSTv5 sea-surface temperature dataset. ------------------------------------------------------------------------------------------------------------------- All values are stored as temperature anomalies in degrees celsius Missing data are set to the value -999.99 Grids are 5x5deg monthly reference_period = [1961 1990] Time:190001-201812(size:1416) Data Array (36x72) Item (1,1) stores the value for the 5-deg-area centred at 0° and 87.5°S Item (36,72) stores the value for the 5-deg-area centred at 360° and 87.5°N
The Twentieth Century Reanalysis Project, produced by the Earth System Research Laboratory Physical Sciences Division from NOAA and the University of Colorado Cooperative Institute for Research in Environmental Sciences using resources from Department of Energy supercomputers, is an effort to produce a global reanalysis dataset spanning a portion of the nineteenth century and the entire twentieth century (1836 - 2015), assimilating only surface observations of synoptic pressure into an 80-member ensemble of estimates of the Earth system. Boundary conditions of pentad sea surface temperature and monthly sea ice concentration and time-varying solar, volcanic, and carbon dioxide radiative forcings are prescribed. Products include 3 and 6-hourly ensemble mean and spread analysis fields and 6-hourly ensemble mean and spread forecast (first guess) fields on a global Gaussian T254 grid. Fields are accessible in yearly time series (1 file per parameter). The Twentieth Century Reanalysis Version 3 uses the NCEP Global Forecast Model that was operational in autumn 2017, with differences as described in (Slivinski et al. 2019). Sea ice boundary conditions are specified from HadISST 2.3 (Slivinski et al. 2019). Sea surface temperature fields prior to 1981 are prescribed from the 8-member ensemble of pentad Simple Ocean Data Assimilation with sparse input (SODAsi.3, Giese et al. 2016) and from the 8-member ensemble of pentad HadISST 2.2 for 1981 to 2015. Observations from ISPD version 4.7 are assimilated using an ensemble Kalman filter. The Twentieth Century Reanalysis Project version 3 used resources of the National Energy Research Scientific Computing Center managed by Lawrence Berkeley National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Version 3 is a contribution to the international Atmospheric Circulation Reconstructions over the Earth initiative. Support for the Twentieth Century Reanalysis Project is provided by the Physical Sciences Division of the NOAA Earth System Research Laboratory, the U.S. Department of Energy Office of Science (BER), and the NOAA Climate Program Office MAPP program.
Raw data for the Figurnoye Lake pollen dataset obtained from the Neotoma Paleoecological Database.Raw data for the Figurnoye Lake pollen dataset obtained from the Neotoma Paleoecological Database.
Context The WVS consists of nationally representative surveys conducted in almost 100 countries which contain almost 90 percent of the world’s population, using a common questionnaire. The WVS is the largest non-commercial, cross-national, time series investigation of human beliefs and values ever executed, currently including interviews with almost 400,000 respondents. Content The World Value Survey data grouped by country and wave. Question codes are matched with the mean for the subgroup if numeric, and else the mode. Also, standard deviation of answers in subgroup are given in columns with code name plus suffix '_SD'. Attached Code File links the variables to their original questionnaire content, including the possible reactions. All negative, and thus missing, responses have been indicated as NA. Acknowledgements The entire dataset has been created and is maintained by the World Values Survey organisation. Find the entire dataset at [their official website][1]. Please note the following disclaimer: These data files are available without restrictions, provided a) that they are used for non-profit purposes; and b) correct citations are provided and sent to the World Values Survey Association for each publication of results based in part or entirely on these data files. This citation will be made freely available; and c) the data files themselves are not redistributed. Inspiration [Quote:][2] The WVS seeks to help scientists and policy makers understand changes in the beliefs, values and motivations of people throughout the world. Thousands of political scientists, sociologists, social psychologists, anthropologists and economists have used these data to analyze such topics as economic development, democratization, religion, gender equality, social capital, and subjective well-being. These data have also been widely used by government officials, journalists and students, and groups at the World Bank have analyzed the linkages between cultural factors and economic development. [1]: http://www.worldvaluessurvey.org/WVSDocumentationWVL.jsp [2]: http://www.worldvaluessurvey.org/WVSContents.jsp
Seed dispersal distance dataset for the Ecology Letters paper
Context ECG data from mit-bih database from physionet Content Raw signals in .csv files and original annotations in .txt. Acknowledgements https://www.physionet.org/physiobank/database/mitdb/
**Context-** This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred. **Content-** This dataset has 2,60,760 rows and 17 columns. - INCIDENT_NUMBER: - OFFENSE_CODE: - OFFENSE_CODE_GROUP: - OFFENSE_DESCRIPTION: - DISTRICT: - REPORTING_AREA: - SHOOTING: - OCCURRED_ON_DATE: - YEAR: - MONTH: - DAY_OF_WEEK: - HOUR: - UCR_PART: - STREET: - LATITUDE: - LONGITUDE: - LOCATION: **Acknowledgements-** I would like to thank the Boston Police Department for making this dataset available to everyone. **Inspiration** 1. How has crime changed over the years? 2. Is it possible to predict where or when a crime will be committed? 3. Which areas of the city have evolved over this time span? 4. In which area most crimes are committed?
This dataset is a sub dataset of the Yelp Challenge.
This dataset contains the data collected for the evaluation of RASH (Research Articles in Simplified HTML), which is presented in the paper 'Research Articles in Simplified HTML: a Web-first format for HTML-based scholarly articles'.In particular, it includes:1) Four CSV files reporting the surveys filled in by authors and reviewers of RASH papers published in the SAVE-SD 2015 and SAVE-SD 2016 workshops:- survey_authors_2015- survey_authors_2016- survey_reviewers_2015- survey_reviewers_20162) Four CSV files presenting the more frequent entities and vocabularies included in the RASH papers published in SAVE-SD 2015 and SAVE-SD 2016 workshops:- entities_analysis_2015.csv- entities_analysis_2016.csv- vocabularies_analysis_2015.csv- vocabularies_analysis_2016.csv For any question about the data please contact francesco.osborne@open.ac.uk or silvio.peroni@unibo.it
crastallization dataset
On August 16, 2018 Aretha Franklin died in Detroit, Michigan at the age of 76. Franklin, also known as the Queen of Soul, had an award winning career as a singer, songwriter, actress and pianist while also being described as the voice of the civil rights movement. This item contains two tweet id datasets. The first was collected from the search API during the response to the announcement of her death, which includes tweets from August 8 - August 19 using the query "Aretha Franklin" OR "Queen of Soul". The second dataset was collected over August 24 to September 3, which includes the date of her funeral on August 31. This second dataset was collected from the search API using the query "Aretha Franklin" OR "Queen of Soul" OR ArethaHomegoing OR ArethaFranklinFuneral OR ArethaFranklin which includes hashtags that were trending at the time. The datasets contain 2,832,128 and 1,332,442 tweet identifiers respectively.
Datasets for all the figures included in the paper - 'Non-Gaussian distribution of collective operators in quantum spin chains', G. De Chiara et al. New J. Phys. 2018.
The human cerebral cortex, whether tracing it through phylogeny or ontogeny, emerges through expansion and progressive differentiation into larger and more diverse areas. While current methodologies address this analytically by characterizing local cortical expansion in the form of surface area, several lines of research have proposed that the cortex in fact expands along trajectories from primordial anchor areas and furthermore, that the distance along the cortical surface is informative regarding cortical differentiation . We sought to investigate the geometric relationships that arise in the cortex based on expansion from such origin points. Towards this aim, we developed a Python package for measuring the geodesic distance along the cortical surface that restricts shortest paths from passing through nodes of non-cortical areas such as the non-cortical portions of the surface mesh described as the “medial wall’.The calculation of geodesic distance along a mesh surface is based in the cumulative distance of the shortest path between two points. The first challenge that arises is the sensitivity of the calculation to the resolution of the mesh: the coarser the mesh, the longer the shortest path may be, as the distance becomes progressively less direct. This problem has been previously addressed and subsequently implemented in the Python package gdist, which calculates the exact geodesic distance along a mesh by subdividing the shortest path until a straight line along the cortex is approximated. The second challenge, for which there was no prefabricated solution, was ensuring that the shortest path only traverses territory within the cortex proper, avoiding shortcuts through non-cortical areas included in the surface mesh — most prominently, the non-cortical portions along the medial wall. Were the shortest paths between two nodes to traverse non-cortical regions, the distance between nodes would be artificially decreased, which would have artifactual impact on the interpretation of results. This concern would be especially relevant to the ‘zones analysis’ described below, where the boundaries between regions would be altered. It was therefore necessary to remove mesh nodes prior to calculating the exact geodesic distance, which requires reconstructing the mesh and assigning the respective new node indices for any seed regions-of-interest.Finally, to facilitate applications to neuroscience research questions, we enabled the loading and visualization of data from commonly used formats such as FreeSurfer and the Human Connectome Project (HCP). A Nipype pipeline for group-level batch processing has also been made available . The pipeline is wrapped in a command-line interface and allows for straightforward distance calculations of entire FreeSurfer-preprocessed datasets. Group-level data are stored as CSV files for each requested mesh resolution, source label and hemisphere, facilitating further statistical analyses.The resultant package, SurfDist, achieves the aforementioned goals of faciliating the calculation of exact geodesic distance on the cortical surface. We present here the distance measures from the central and calcarine sulci labels on the FreeSurfer native surfaces. The distance measure provides a means to parcellate the cortex using the surface geometry. Towards that aim, we also implement a ‘zones analysis’, which constructs a Voronoi diagram, establishing partitions based on the greater proximity to a set of label nodes.The SurfDist package is designed to enable investigation of intrinisic geometric properties of the cerebral cortex based on geodesic distance measures. Towards the aim of enabling applications specific to neuroimaging-based research question, we have designed the package to facilitate analysis and visualization of geodesic distance metrics using standard cortical surface meshes.
Final dataset used to generate results of this publication. Note that you must take the medians for each taxon to reproduce the results.
NCEP ADP ETA / NAM Upper Air Observation Subsets are composed of a regional synoptic set of upper air reports centered over North America, operationally collected by the National Centers for Environmental Prediction (NCEP). These include radiosondes, pibals and aircraft reports from the Global Telecommunications System (GTS) and satellite data from the National Environmental Satellite Data and Information Service (NESDIS). The reports can include pressure, geopotential height, temperature, dew point depression, wind direction and speed. Data may be available at up to 20 mandatory levels from 1000 millibars to 1 millibar, plus a few significant levels. Report time intervals range from 3 hourly to 12 hourly. These data are the primary input to the EDAS / NAM Data Assimilation System (NDAS starting January 23, 2005). DS351.0 [https://rda.ucar.edu/datasets/ds351.0/] provides global data coverage over the same time period.This data set is no longer updated.If you have a need for North American data that is not met by DS351.0 NCEP ADP Global Upper Air Observational Weather Data, October 1999 - continuing [https://rda.ucar.edu/datasets/ds351.0/], contact the RDA for alternatives.
This dataset contains the International Surface Pressure Databank version 3.2.9 (ISPDv3), the world's largest collection of pressure observations. It has been gathered through international cooperation with data recovery facilitated by the ACRE Initiative and the other contributing organizations and assembled under the auspices of the GCOS Working Group on Surface Pressure and the WCRP/GCOS Working Group on Observational Data Sets for Reanalysis by NOAA Earth System Research Laboratory (ESRL), NOAA's National Climatic Data Center (NCDC), and the University of Colorado's Cooperative Institute for Research in Environmental Sciences (CIRES). The ISPDv3 consists of three components: station, marine, and tropical cyclone best track pressure observations. The station component is a blend of many national and international collections. In addition to the pressure observations and metadata, ISPDv3 contains feedback from the 20th Century Reanalysis version 2c (20CRV2c), including quality control information and uncertainty information. Support for the International Surface Pressure Databank is provided by the U.S. Department of Energy, Office of Science Biological and Environmental Research (BER), and by the National Oceanic and Atmospheric Administration Climate Program Office. The International Surface Pressure Databank version 3 and 20th Century Reanalysis version 2c used resources of the National Energy Research Scientific Computing Center which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
This is a sample log of HDFS dataset. Please visit our project page for the full set of system logs: https://github.com/logpai/loghub
This dataset is a compilation of observational measurements on comet morphology and magnitude covering just over a thousand apparitions of various comets from (UTC) -466 to 1975.
The data set contains combined Dynamic Ocean Topography (DOT) and geostrophic velocity components for the northern Nordic Seas between 1995 and 2012. It was produced in the frame of the DFG project NEG-OCEAN: Variations in ocean currents, sea-ice concentration, and sea surface temperature along the North-East coast of Greenland. The data is provided as Format 4 Classic NetCDF files on an unstructured triangular, Finite Element formulated grid. The data are characterized by daily sampling between 18.5.1995 and 3.4.2012 including data gaps and a consistent spatial resolution up to 1 km. More details can be found in the related User Manual. The dataset is based on Dynamic Ocean Topography (DOT) elevations from a combination of along-track satellite altimetry measurements with simulated differential water heights from the Finite Element Sea-ice Ocean Model Version 1.4 (FESOM, Wekerle et al., 2017, doi:10.1002/2017JC012974). The combination approach is described in detail in the related publication. The altimetry data include observations of the ESA satellites Envisat and ERS-2. The high-frequent altimetry range observations are retracked using the ALES+ algorithm (Passaro et al., 2018, doi:10.1016/j.rse.2018.02.074) and are classified into open-water/sea-ice conditions by applying a classification algorithm (Müller et al., 2017, doi:10.3390/rs9060551). All applied atmospheric and geophysical altimetry corrections are listed in Müller et al., 2019 (doi:10.5194/tc-13-611-2019).
Cumulative fitness (times flowered + 1 if the plant survived to the end of the experiment) in crosses within the campions Silene dioica (L.) Clairv. and S. latifolia Poiret, as well as their first- and second-generation hybrids, in a four-year transplant experiment at three sites of each species (data from Favre et al., 2017, New Phytologist 213, 1487-1499). Individuals that died soon after transplantation (transplant shock) or due to mole disturbance or could not be sexed were excluded from the dataset here (see Favre et al. 2017). Column names: Site.ID, identification of individuals across sites; site, transplant site; ID, identification number of individuals within sites; cross, cross type [SD: within S. dioica, SL: within S. latifolia, HD: SD female x SL male, HL: SL female x SD male, F2: second-generation hybrids (pooled)]; habitat, SD: transplanted within SD population, SL: transplanted within SL population; dblock, block within site; family, full-sib family; sex, plant sex; fitness.rev, cumulative fitness (times flowered + 1 if the plant survived to the end of the experiment).
Datasets for "The Accounting Network: how financial institutions react to systemic crisis"
Female survival. Observe that the territory identity does not match other datasets.
The database contains fasta sequences from UniProt and associated metadata for molluscan shell matrix proteins (SMPs). The database only contains SMPs that have been experimentally validated to be present in molluscan shell matrices (based on the publication(s) attached to the UniProtID). Metadata includes information on functional domains present in the sequence, as detected by InterproScan. With the advent of Next Generation Sequencing technologies, it is computationally resource intensive to run sequence similarity algorithms on all published data. Moreover, it is impractical to sort through hundreds of sequence similarity search results when working with non-model organisms, since pre-established functional annotations of sequences are generally not available. Therefore, this database was created in order to provide a targeted molluscan biomineralization dataset for sequence similarity algorithms (such as BLAST).
Context This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the dataset you'll find information about businesses across 11 metropolitan areas in four countries. Content This dataset contains seven CSV files. The original JSON files can be found in yelp_academic_dataset.zip. You may find this documentation helpful: [https://www.yelp.com/dataset/documentation/json][1] In total, there are : - 5,200,000 user reviews - Information on 174,000 businesses - The data spans 11 metropolitan areas Acknowledgements The dataset was converted from JSON to CSV format and we thank the team of the Yelp dataset challenge for creating this dataset. By downloading this dataset, you agree to the [Yelp Dataset Terms of Use][2]. Inspiration Natural Language Processing & Sentiment Analysis What's in a review? Is it positive or negative? Yelp's reviews contain a lot of metadata that can be mined and used to infer meaning, business attributes, and sentiment. Graph Mining We recently launched our Local Graph but can you take the graph further? How do user's relationships define their usage patterns? Where are the trend setters eating before it becomes popular? [1]: https://www.yelp.com/dataset/documentation/json [2]: https://s3-media2.fl.yelpcdn.com/assets/srv0/engineering_pages/af4b9cebfb4f/assets/vendor/dataset-challenge-dataset-agreement.pdf
Forest plot showing the impact on overall surival of CDR2 expression in different public transcriptomic datasets, after adjusting by debulking surgery (residual tumor <1cm) and FIGO stage.
Raw data for the Toothaker Pond pollen surface sample dataset obtained from the Neotoma Paleoecological Database.
This dataset provides the data collected for a trial investigating the role of hydration status on glycaemic regulation in healthy adults (n = 16; n = 8 male). To our knowledge, the effect of hydration status on glycemia has never been causally investigated in healthy adults. Therefore, the goal was to explore how acute hypohydration impacts blood sugar control in healthy adults. The trial was a randomised crossover trial, with each trial arm lasting 5 days. The first 3 days were lifestyle monitoring, day 4 was a dehydration/rehydration day (including lifestyle monitoring), and day 5 was the full trial day. The trial arms were hypohydrated (HYPO), or rehydrated (RE).
The guideline to use the dataset.
LC-MS/MS water concentrations (EE2, DEET, diphenhydramine, fluoxetine and their mixture) dataset for the study: Untargeted Metabolomic Investigation of the Eastern Oyster
Context The Global Shark Attack File contains a global log of all reported shark attacks from ~1700s to 2018. Content Each record, or shark attack, includes the location, date, time of attack, shark species, and other information about the incident. Acknowledgements Shark Research Institute International Shark Attack File http://www.sharkattackfile.net/incidentlog.htm Inspiration I am interested in exploring if the number of shark attacks is rising as human populations and temperatures are increasing.
This is a car sales data set which is taken from Analytixlabs. This is a continuous data set which include several predictors.From this data set we have to predict car sales by using machine learning Techniques. So lets work on this dataset together and carry out which machine learning technique is best suited for prediction. I am using R language you can use any language for this .So go for it and enjoy. If you have any query related data you can freely post your query
Context Eclipses of the sun can only occur when the moon is near one of its two orbital nodes during the new moon phase. It is then possible for the Moon's penumbral, umbral, or antumbral shadows to sweep across Earth's surface thereby producing an eclipse. There are four types of solar eclipses: a partial eclipse, during which the moon's penumbral shadow traverses Earth and umbral and antumbral shadows completely miss Earth; an annular eclipse, during which the moon's antumbral shadow traverses Earth but does not completely cover the sun; a total eclipse, during which the moon's umbral shadow traverses Earth and completely covers the sun; and a hybrid eclipse, during which the moon's umbral and antumbral shadows traverse Earth and annular and total eclipses are visible in different locations. Earth will experience 11898 solar eclipses during the five millennium period -1999 to +3000 (2000 BCE to 3000 CE). Eclipses of the moon can occur when the moon is near one of its two orbital nodes during the full moon phase. It is then possible for the moon to pass through Earth's penumbral or umbral shadows thereby producing an eclipse. There are three types of lunar eclipses: a penumbral eclipse, during which the moon traverses Earth's penumbral shadow but misses its umbral shadow; a partial eclipse, during which the moon traverses Earth's penumbral and umbral shadows; and a total eclipse, during which the moon traverses Earth's penumbral and umbral shadows and passes completely into Earth's umbra. Earth will experience 12064 lunar eclipses during the five millennium period -1999 to +3000 (2000 BCE to 3000 CE). Acknowledgements Lunar eclipse predictions were produced by Fred Espenak from NASA's Goddard Space Flight Center.
Images from Imagenet dataset (http://image-net.org/) resized to 32x32
Dataset connected with article: 'Increasing the use of conceptually-derived strategies in arithmetic: using inversion problems to promote the use of associativity shortcuts.' Abstract: Conceptual knowledge of key principles underlying arithmetic is an important precursor to understanding algebra and later success in mathematics. One such principle is associativity, which allows individuals to solve problems in different ways by decomposing and recombining subexpressions (e.g. ‘a + b – c’ = ‘b – c + a’). More than any other principle, children and adults alike have difficulty understanding it, and educators have called for this to change. We report three intervention studies that were conducted in university classrooms to investigate whether adults’ use of associativity could be improved. In all three studies, it was found that those who first solved inversion problems (e.g. ‘a + b – b’) were more likely than controls to then use associativity on ‘a + b – c’ problems. We suggest that ‘a + b – b’ inversion problems may either direct spatial attention to the location of ‘b – c’ on associativity problems, or implicitly communicate the validity and efficiency of a right-to-left strategy. These findings may be helpful for those designing brief activities that aim to aid the understanding of arithmetic principles and algebra.
This dataset contains key characteristics about the data described in the Data Descriptor The sequence and de novo assembly of Takifugu bimaculatus genome using PacBio and Hi-C technologies. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format 3. machine readable metadata file in ISA-Tab format (zipped folder)
Raw data for the Fisherman Lake pollen surface sample dataset obtained from the Neotoma Paleoecological Database.
Context This dataset is a collection newsgroup documents. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. Content There is file (list.csv) that contains a reference to the document_id number and the newsgroup it is associated with. There are also 20 files that contain all of the documents, one document per newsgroup. In this dataset, duplicate messages have been removed and the original messages only contain "From" and "Subject" headers (18828 messages total). Each new message in the bundled file begins with these four headers: Newsgroup: alt.newsgroup Document_id: xxxxxx From: Cat Subject: Meow Meow Meow The Newsgroup and Document_id can be referenced against list.csv Organization - Each newsgroup file in the bundle represents a single newsgroup - Each message in a file is the text of some newsgroup document that was posted to that newsgroup. This is a list of the 20 newsgroups: - comp.graphics - comp.os.ms-windows.misc - comp.sys.ibm.pc.hardware - comp.sys.mac.hardware - comp.windows.x rec.autos - rec.motorcycles - rec.sport.baseball - rec.sport.hockey sci.crypt - sci.electronics - sci.med - sci.space - misc.forsale talk.politics.misc - talk.politics.guns - talk.politics.mideast talk.religion.misc - alt.atheism - soc.religion.christian Acknowledgements Ken Lang is credited by the source for collecting this data. The source of the data files is here: http://qwone.com/~jason/20Newsgroups/ Inspiration - This dataset text can be used to classify text documents
I have begun studying how to apply R towards the field of Market Research primarily through the use of this book https://www.amazon.com/Marketing-Research-Analytics-Use/dp/3319144359 The book instructs that I build alot of the models from scratch to get a better appreciation for how R works and this dataset is to serve as a portfolio for all that I have learned thus far.
1. Title: Postoperative Patient Data 2. Source Information: -- Creators: Sharon Summers, School of Nursing, University of Kansas Medical Center, Kansas City, KS 66160 Linda Woolery, School of Nursing, University of Missouri, Columbia, MO 65211 -- Donor: Jerzy W. Grzymala-Busse (jerzy@cs.ukans.edu) (913)864-4488 -- Date: June 1993 3. Past Usage: 1. A. Budihardjo, J. Grzymala-Busse, L. Woolery (1991). Program LERS_LB 2.5 as a tool for knowledge acquisition in nursing, Proceedings of the 4th Int. Conference on Industrial & Engineering Applications of AI & Expert Systems, pp. 735-740. 2. L. Woolery, J. Grzymala-Busse, S. Summers, A. Budihardjo (1991). The use of machine learning program LERS_LB 2.5 in knowledge acquisition for expert system development in nursing. Computers in Nursing 9, pp. 227-234. 4. Relevant Information: The classification task of this database is to determine where patients in a postoperative recovery area should be sent to next. Because hypothermia is a significant concern after surgery (Woolery, L. et. al. 1991), the attributes correspond roughly to body temperature measurements. Results: -- LERS (LEM2): 48% accuracy 5. Number of Instances: 90 6. Number of Attributes: 9 including the decision (class attribute) 7. Attribute Information: 1. L-CORE (patient's internal temperature in C): high (> 37), mid (>= 36 and <= 37), low (< 36) 2. L-SURF (patient's surface temperature in C): high (> 36.5), mid (>= 36.5 and <= 35), low (< 35) 3. L-O2 (oxygen saturation in %): excellent (>= 98), good (>= 90 and < 98), fair (>= 80 and < 90), poor (< 80) 4. L-BP (last measurement of blood pressure): high (> 130/90), mid (<= 130/90 and >= 90/70), low (< 90/70) 5. SURF-STBL (stability of patient's surface temperature): stable, mod-stable, unstable 6. CORE-STBL (stability of patient's core temperature) stable, mod-stable, unstable 7. BP-STBL (stability of patient's blood pressure) stable, mod-stable, unstable 8. COMFORT (patient's perceived comfort at discharge, measured as an integer between 0 and 20) 9. decision ADM-DECS (discharge decision): I (patient sent to Intensive Care Unit), S (patient prepared to go home), A (patient sent to general hospital floor) 8. Missing Attribute Values: Attribute 8 has 3 missing values 9. Class Distribution: I (2) S (24) A (64)
SET 1: The dataset is composed of a set of omnidirectional images captured in an indoor environment (Quorum V building, ground floor, ARVC Laboratory) at Miguel Hernández University, Spain. This database is intended to test visual mapping and localization algorithms for mobile robots. The images have been captured using an Imaging Source DFK 21BF04 camera, which takes pictures of a hyperbolic mirror (Eizoh Wide 70). The mirror is mounted over the camera, with its axis aligned with the camera optic axis. The whole database contains 400 images that were captured while the robot went through a previously defined trajectory in a laboratory area. The distance between each pair of consecutive images is equal to 20 cm and the environment where the images were captured is very prone to visual aliasing (the visual appearance of some images captured in different rooms may be very similar. SET 2: Set of omnidirectional images captured in an indoors environment (Quorum V building, 2nd floor) at Miguel Hernandez University. The database includes a corridor, three offices, a library and an events room. It is composed of 872 omnidirectional colour images which have been captured on a dense regular 40x40 cm. grid of points. A bird eye's view of the grid points is included. The database was captured with an Imaging Source DFK 21BF04 camera, which takes pictures of a hyperbolic mirror (Eizoh Wide 70). The mirror is mounted over the camera, with its axis aligned with the camera optic axis.
Fasta file containing alignment of sequences used to calculate observed summary statistics. So called ABC dataset in the publication.
ddPCR dataset
Subset of the "iCubWorld Transformations" dataset (https://robotology.github.io/iCubWorld/) to be used for the Deep Learning hands-on session at the Winter School on Humanoid Robot Programming (http://www.icub.org/winterschool/).
Supplemental Data. Walker et al. (2017). Plant Cell 10.1105/tpc.16.00961.Supplemental Dataset 1. RMA-normalised Nimblegen microarray data for all transcripts measured. The table lists the TAIR10 Arabidopsis Genome Initiative (AGI) gene IDs represented on the array (for design see GEO record GPL18735 for Nimblegen probe design) and their expression values in all 6 time series, averaged (“Mean”) for each replicate set; see GEO GSE91379 for complete raw and normalised individual replicate values. Gene symbols and gene descriptions are listed according to the TAIR10 annotation. If significantly differentially expressed within a time series (BATS), the cluster number is listed. If significantly differentially expressed between treated and untreated time series (CN GP2S etc), the cluster number is listed. Cluster numbers described in the manuscript text always refer to the within-CU/PU or N/Rhizobia vs. U clusters (orange columns). Empty cells indicate no evidence of DE within/between time series. Transcripts associated with genes that have previously been found to be affected by the protoplast generation treatment or FACS (as [18]) are marked (Proto-flagged) but not removed from the analysis.
R environment file for 1000 simulated datasets under BM, and 1000 simulated datasets under BM with a trend (µ = 3).
The GIS database contains the data of aufeis (naleds) in the Indigirka River basin (Russia) from historical and nowadays sources, and complete ArcGIS 10.1/10.2 and Qgis 3* projects to view and analyze the data. All data and projects have WGS 1984 coordinate system (without projection). ArcGIS and Qgis projects contain two layers, such as Aufeis_kadastr (historical aufeis data collection, point objects) and Aufeis_Landsat (satellite-derived aufeis data collection, polygon objects).Historical data collection is created based on the Cadastre of aufeis (naleds) of the North-East of the USSR (1958). Each aufeis was digitized as point feature by the inventory map (scale 1:2 000 000), or by topographic maps. Attributive data was obtained from the Cadastre of aufeis. According to the historical data, there were 896 aufeis with a total area 2063.6 km² within the studied basin.Present-day aufeis dataset was created by Landsat-8 OLI images for the period 2013-2017. Each aufeis was delineated by satellite images as polygon. Cloud-free Landsat images are obtained immediately after snowmelt season (e.g. between May, 15 and June, 18), to detect the highest possible number of aufeis. Critical values of Normalized Difference Snow Index (NDSI) were used for semi-automated aufeis detection. However, a detailed expert-based verification was performed after automated procedure, to distinguish snow-covered areas from aufeis and cross-reference historical and satellite-based data collections. According to Landsat data, the number of aufeis reaches 1213, with their total area about 1287 km². The difference between the Cadastre (1958) and the satellite-derived data may indicate significant changes of aufeis formation environments.
Excel spreadsheet with the complete diet and body mass (from Dunning 2008) dataset, including associated metadata, listed by species, season and locality.
310 Observations, 13 Attributes (12 Numeric Predictors, 1 Binary Class Attribute - No Demographics) Lower back pain can be caused by a variety of problems with any parts of the complex, interconnected network of spinal muscles, nerves, bones, discs or tendons in the lumbar spine. Typical sources of low back pain include: - The large nerve roots in the low back that go to the legs may be irritated - The smaller nerves that supply the low back may be irritated - The large paired lower back muscles (erector spinae) may be strained - The bones, ligaments or joints may be damaged - An intervertebral disc may be degenerating An irritation or problem with any of these structures can cause lower back pain and/or pain that radiates or is referred to other parts of the body. Many lower back problems also cause back muscle spasms, which don't sound like much but can cause severe pain and disability. While lower back pain is extremely common, the symptoms and severity of lower back pain vary greatly. A simple lower back muscle strain might be excruciating enough to necessitate an emergency room visit, while a degenerating disc might cause only mild, intermittent discomfort. This data set is about to identify a person is abnormal or normal using collected physical spine details/data.
This is a piece of bike sharing system data analytics work using the CRISP-DM data mining process.
Context Twitter give the general public unfiltered direct access to the ideas and policies of politicians. This means that understanding the content and reach of these tweets can help us understand what connects with constituents. This dataset is meant to help with that exploration. By applying sentiment analysis (using an already trained system) we can apply sentiment context to these tweets. This will help us understand who responds to positive and negative content. Finally this analysis may help to indentify fake or hyperbole polarized Twitter users. Content The dataset contains two files both in .csv format. The first is a list of the political party and the representative handles, and the second are the 200 latest tweets as of May 2018 from those twitter users. Acknowledgements I would like to thank the following website and people who helped me get started Inspiration I was first inspired by trying to find out if the average person would be able to distinguish between political tweets of no context was given. I made a small website that you can try this on. I will use real user data to cross check and see if ML methods are actually better than the average person. Other ace uses are the following: Can we use this to detect Russian troll twitter accounts? Do people respond to negative or positive political tweets?
Context The need for music-speech classification is evident in many audio processing tasks which relate to real-life materials such as archives of field recordings, broadcasts and any other contexts which are likely to involve speech and music, concurrent or alternating. Segregating the signal into speech and music segments is an obvious first step before applying speech-specific or music-specific algorithms. Indeed, speech-music classification has received considerable attention from the research community (for a partial list, see references below) but many of the published algorithms are dataset-specific and are not directly comparable due to non-standardised evaluation. Content Dataset collected for the purposes of music/speech discrimination. The dataset consists of 120 tracks, each 30 seconds long. Each class (music/speech) has 60 examples. The tracks are all 22050Hz Mono 16-bit audio files in .wav format.
Domain: Real Estate Difficulty: Easy to Medium Challenges: 1. Missing value treatment 2. Outlier treatment 3. Understanding which variables drive the price of homes in Boston Summary: The Boston housing dataset contains 506 observations and 14 variables. The dataset contains missing values.
Context This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. Content The datasets consists of several medical predictor variables and one target variable, `Outcome`. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. Acknowledgements Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). [Using the ADAP learning algorithm to forecast the onset of diabetes mellitus][1]. *In Proceedings of the Symposium on Computer Applications and Medical Care* (pp. 261--265). IEEE Computer Society Press. Inspiration Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not? [1]: http://rexa.info/paper/04587c10a7c92baa01948f71f2513d5928fe8e81
Context Simple Convolutional Neural Networks (CNN’s) model works amazingly well in classifying the MNIST hand written digits or differentiating dogs and cats even with a small dataset of few thousand images. It will be fun project to test how well the same simple techniques works if we were trying to classify famous persons from their cartoon images or caricature Content The training dataset consists of cartoon images of six famous personalities (Abraham Lincoln, Albert Einstein, Barack Obama, Donald Trump, Mahatma Gandhi and Steve Jobs) downloaded from Google Image Search. The training dataset consists of 4942 images and validation dataset consists of 2060 images and test dataset consist of 856 images. The images are in jpg format.
Content The dataset contains different noise audio files correspond to different environments. Acknowledgements This is a mirror of the database introduced by the authors of the [DEMAND][1]. [1]: https://hal.inria.fr/hal-00796707/file/thiemann_demand.pdf
This is the raw dataset that generated the figures in the paper entitled "Characterization of the hemodynamic response function in white matter tracts for event-related fMRI."
A replicate dataset based on the A3130 sample set. The DNA isolates were re-processed in a different lab than the original samples using a different machine (ABI 3500 Genetic Analyzer).The RData object has two attributes: data - holds the count data of OTU abundances. Rows are samples and columns are OTUs.labels - a corresponding label for each sample in data.
Supplementary Table 1. Primers used for quantitative real time PCR.Supplementary Table 2. Overlapped DEGs among five microarray datasets.
Many maps of open water and wetland have been developed based on three main methods: (i) compiling national/regional wetland surveys; (ii) identifying inundated areas by satellite imagery; (iii) delineating wetlands as shallow water table areas based on groundwater modelling. The resulting global wetland extents, however, vary from 3 to 21% of the land surface area, because of inconsistencies in wetland definitions and limitations in observation or modelling systems. To reconcile these differences, we propose composite wetland (CW) maps combining two classes of wetlands: (1) regularly flooded wetlands (RFW) which are obtained by overlapping selected open-water and inundation datasets; (2) groundwater-driven wetlands (GDW) derived from groundwater modelling (either direct or simplified using several variants of the topographic index). Wetlands are thus statically defined as areas with persistent near saturated soil because of regular flooding or shallow groundwater. To explore the uncertainty of the proposed data fusion, seven CW maps were generated at the 15 arc-sec resolution (ca 500 m at the Equator) using geographic information system (GIS) tools, by combining one RFW and different GDW maps. They correspond to contemporary potential wetlands, i.e. the expected wetlands assuming no human influence under the present climate. To validate the approach, these CW maps were compared to existing wetland datasets at the global and regional scales: the spatial patterns are decently captured, but the wetland extents are difficult to assess against the dispersion of the validation datasets. Compared to the only regional dataset encompassing both GDWs and RFWs, over France, the CW maps perform well and better than all other considered global wetland datasets. Two CW maps, showing the best overall match with the available evaluation datasets, are eventually selected. They give a global wetland extent of 27.5 and 29 million km², i.e. 21.1 and 21.6% of global land area, which is among the highest values in the literature, in line with recent estimates also recognizing the contribution of GDWs. This wetland class covers 15% of global land area, against 9.7% for RFWs (with an overlap ca 3.4 %), including wetlands under canopy/cloud cover leading to high wetland densities in the tropics, and small scattered wetlands, which cover less than 5% of land but are very important for hydrological and ecological functioning in temperate to arid areas. By distinguishing the RFWs and GDWs globally based on uniform principles, the proposed dataset is believed to be useful for large-scale land surface modelling (hydrological, ecological and biogeochemical modelling) and environmental planning.
IMDB dataset for 5000 movies
The dataset contains a gridded global reconstruction of monthly runoff timeseries. In-situ streamflow observations from the GSIM dataset are used to train a machine learning algorithm that predicts monthly runoff rates based on antecedent precipitation and temperature from the Global Soil Wetness Project Phase 3 (GSWP3) meteorological forcing dataset. We thank Prof. Dr. Hyungjun Kim for developing the GSWP3 dataset and providing us with early access to the data. The data are provided in NetCDFv4 format at monthly resolution covering the period 1902-2014. The GRUN reconstruction ("GRUN_v1_GSWP3_WGS84_05_1902_2014.nc" file) is provided on a 0.5 degrees (WGS84) grid in units of mm/day. The runoff time series correspond to the ensemble mean of 50 reconstructions obtained by training the machine learning model with different subsets of data. The individual ensemble members of the reconstruction are provided in the "Realizations_GRUN_v1_GSWP3_WGS84_05_1902_2014.zip" file.When using this dataset, please cite: Ghiggi, G., Humphrey, V., Seneviratne, S. I., Gudmundsson (2019), GRUN: An observations-based global gridded runoff dataset from 1902 to 2014, Earth Syst. Sci. Data, 2019, DOI: https://doi.org/10.5194/essd-2019-32 The complete collection of in-situ streamflow observations from the GSIM archive can be found at: - https://doi.pangaea.de/10.1594/PANGAEA.887477 - https://doi.pangaea.de/10.1594/PANGAEA.887470 For further information on the GSIM dataset see: - https://doi.org/10.5194/essd-10-765-2018 - https://doi.org/10.5194/essd-10-787-2018 For further information on GSWP3, see: - https://doi.org/10.20783/DIAS.501 - https://hyungjun.github.io/GSWP3.DataDescription - http://hydro.iis.u-tokyo.ac.jp/GSWP3/exp1.html
**NOTE**: we\'re having some trouble uploading the actual images of the handwritten names. Stay tuned. This dataset contains links to images of handwritten names along with human contributors’ transcription of these written names. Over 125,000 examples of first or last names. Most names are French, making this dataset of particular interest for work on dealing with accent marks in handwritten character recognition. Acknowledgments Data was provided by the [Data For Everyone Library](https://www.crowdflower.com/data-for-everyone/) on [Crowdflower](https://www.crowdflower.com). Our Data for Everyone library is a collection of our favorite open data jobs that have come through our platform. They\'re available free of charge for the community, forever. The Data A file ```handwritten_names.csv``` that contains the following fields: - **_unit_id**: a unique id for the image - **image_url**: the path to the image; begins with "images/" - **transcription**: the (typed) name - **first_or_last**: whether it\'s a first name or a last name A folder ```images``` that contains each of the image files.
Raw Data Files **This data set contains Bitcoin data for years 2009-2011. For years 2011-2018 (~45GB), please see https://github.com/cakcora/CoinWorks/blob/master/data.MD** We provide input and output edges of transactions. This data is divided into yearly and monthly files. Each year\'s data is zipped together and contains 12 input edge files and 12 output edge files of transactions that were mined in the blocks of that year/month. Each line in the input edge file is tab separated with the format: ``` Unix time of transaction\\thash of transaction\\thash of first input transaction\\tindex of output from first input transaction\\thash of second input transaction\\tindex of output from second input transaction\\t(additional inputs, if exist)\\r\\n ``` Each line in the output edge file is tab separated with the format: ``` Unix time of transaction\\thash of transaction\\thash of first output address\\tamount of first output bitcoins\\thash of second output address\\tamount of second output bitcoins\\t(additional outputs, if exist)\\r\\n ``` ![Bitcoin Graph][1] Consider the Bitcoin graph in the figure above, where transactions and addresses are shown with rectangles and circles, respectively. This graph would be given in two files: inputsYear_Month.txt and outputsYear_Month.txt. Files would include these lines: -- inputsYear_Month.txt ``` UnixTimeOft_1 HashOft_1 HashOft_x1 0 HashOft_x2 8 UnixTimeOft_2 HashOft_1 HashOft_x3 1 HashOft_x4 3 HashOft_x5 0 UnixTimeOft_3 HashOft_1 1 UnixTimeOft_4 HashOft_3 2 HashOft_2 0 ``` -- outputsYear_Month.txt ``` UnixTimeOft_1 HashOft_1 HashOfa_6 10^8 HashOfa_7 0.8^0.8 UnixTimeOft_2 HashOft_2 HashOfa_8 3.8*10^8 UnixTimeOft_3 HashOft_3 HashOfa_9 0.2*10^8 HashOfa_10 0.2*10^8 HashOfa_11 0.3*10^8 UnixTimeOft_4 HashOft_4 HashOfa_12 3.7*10^8 HashOfa_13 0.3*10^8 ``` <a href="https://utdallas.box.com/s/73i8q4g59ceoum9scc4kkbhi4ritmueg">2009 data (0.1MB)</a> <a href="https://utdallas.box.com/s/6g2li4ls8zk2wfnf3tsl3gsr713r4pms">2010 data (15MB)</a> <a href="https://utdallas.box.com/s/bu30643q4l0a79b4907c2a51tx31s16a">2011 data (300MB)</a> <a href="https://utdallas.box.com/s/vb60kxanb2yifq2yaozviu6nsojnzm1c">2012 data (1.2GB)</a> <a href="https://utdallas.box.com/s/t2w1dc4xbds377lfgxulzk44sr6fwj1t">2013 data (3.2GB)</a> <a href="https://utdallas.box.com/s/xrh9bw8ctmy0kuvx24h6b8v53tdwi127">2014 data (5.2GB)</a> <a href="https://utdallas.box.com/s/zl1n1wh1dqgcicj59qvd8cmas2iz936y">2015 data (9.6GB)</a> <a href="https://utdallas.box.com/s/vuog5rneci364h4m6w5f8eursk2ym786">2016 inputs (8.1GB)</a> <a href="https://utdallas.box.com/s/9wozbdip3yjkfxgnqkf6x3jww9v5rm3m">2016 outputs (8.5GB)</a> <a href="https://utdallas.box.com/s/atscqz8cle50rc5abvbc4ct20qdqoyhi">2017 data until August (13.2GB)</a> Please [visit the full dataset page][2] for your data related questions. [1]: https://user-images.githubusercontent.com/6596905/38154759-80cbf57a-3439-11e8-8d84-9706e5825d5c.png [2]: https://github.com/cakcora/CoinWorks/blob/master/data.MD
Context Wikipedia, the world's largest encyclopedia, is a crowdsourced open knowledge project and website with millions of individual web pages. This dataset is a grab of the title of every article on Wikipedia as of September 20, 2017. Content This dataset is a simple newline (`\\n`) delimited list of article titles. No distinction is made between redirects (like `Schwarzenegger`) and actual article pages (like `Arnold Schwarzenegger`). Acknowledgements This dataset was created by scraping [Special:AllPages](https://en.wikipedia.org/w/index.php?title=Special:AllPages) on Wikipedia. It was originally shared [here](https://www.reddit.com/r/datasets/comments/71954f/a_list_of_all_14mil_english_wikipedia_article/). Inspiration * What are common article title tokens? How do they compare against frequent words in the English language? * What is the longest article title? The shortest? * What countries are most popular within article titles?
List of all files<i>Readme file</i> 00_readme.txt<i>Monthly grids - ensemble means</i> 01_monthly_grids_ensemble_means_allmodels.zip<i>Monthly grids - ensembles, model 1 to 6</i> 02_monthly_grids_ensemble_JPL_MSWEP_1979_2016.zip 02_monthly_grids_ensemble_JPL_GSWP3_1979_2014.zip 02_monthly_grids_ensemble_JPL_ERA5_1979_2018.zip 02_monthly_grids_ensemble_GSFC_MSWEP_1979_2016.zip 02_monthly_grids_ensemble_GSFC_GSWP3_1979_2014.zip 02_monthly_grids_ensemble_GSFC_ERA5_1979_2018.zip<i>Daily grids - ensemble means, model 1 to 6</i> 03_daily_grids_ensemble_means_JPL_MSWEP_1979_2016.zip 03_daily_grids_ensemble_means_JPL_GSWP3_1979_2014.zip 03_daily_grids_ensemble_means_JPL_ERA5_1979_2018.zip 03_daily_grids_ensemble_means_GSFC_MSWEP_1979_2016.zip 03_daily_grids_ensemble_means_GSFC_GSWP3_1979_2014.zip 03_daily_grids_ensemble_means_GSFC_ERA5_1979_2018.zip<i>Global averages - daily and monthly time series</i> 04_global_averages_allmodels.zipContent of readmeGRACE TWS Reconstruction (GRACE_REC_v03)The dataset contains reconstructed time series of daily and monthly anomalies of terrestrial water storage (TWS) based on two different GRACE solutions and three different meteorological forcing datasets. There is a total of 6 different models:JPL_MSWEP - trained with GRACE JPL mascons, forced with MSWEP forcing (1979-2016)JPL_GSWP3 - trained with GRACE JPL mascons, forced with GSWP3 forcing (1901-2014)JPL_ERA5 - trained with GRACE JPL mascons, forced with ERA5 forcing (1979-present)GSFC_MSWEP - trained with GRACE GSFC mascons, forced with MSWEP forcing (1979-2016)GSFC_GSWP3 - trained with GRACE GSFC mascons, forced with GSWP3 forcing (1901-2014)GSFC_ERA5 - trained with GRACE GSFC mascons, forced with ERA5 forcing (1979-present)The reconstruction aims at reproducing the sub-decadal climate-driven variability observed in the GRACE data. Seasonal cycle and human impacts on TWS are not reconstructed. A GRACE-based seasonal cycle is provided for convenience and should be used with the awareness that in reality long-term changes in the shape of the seasonal cycle might potentially occur. Long-term signals (trends over a period >15 years) are removed during the model calibration procedure but are still present in the final dataset. The interpretation of the reconstructed long-term trends should be done with particular caution and the awareness that there can be some uncertainty in the reconstructed trends.For most applications, uncertainty ranges can be derived from the 100 ensemble members available for each model.The grids are stored in NetCDFv4 files in units of mm (kg m^-2). Although the data is provided on a 0.5 degrees grid, the effective spatial resolution should be considered to be 3 degrees, similar to the original resolution of the GRACE datasets. This might need to be taken into account when comparing this dataset against other sources.The global means are stored as csv files in units of Gt of water. To convert back to mm of water, use the land area values given in the reference paper below.When using this dataset, please cite:Humphrey V. & Gudmundsson L. (submitted). GRACE-REC: A reconstruction of climate-driven water storage changes over the last century. Earth System Science Data Discussions.Vincent Humphrey, May 2019California Institute of TechnologyYour feedback is always welcome:vincent.humphrey[-a-t-]caltech.edu (vincent.humphrey[-a-t-]bluewin.ch) Abstract The amount of water stored on continents is an important constraint for water mass and energy exchanges in the Earth system and exhibits large inter-annual variability at both local and continental scales. From 2002 to 2017, the satellites of the Gravity Recovery and Climate Experiment mission (GRACE) have observed changes in terrestrial water storage (TWS) with an unprecedented level of accuracy. In this paper, we use a statistical model trained with GRACE observations to reconstruct past climate-driven changes in TWS from historical and near real time meteorological datasets at daily and monthly scales. Unlike most hydrological models which represent water reservoirs individually (e.g. snow, soil moisture, etc.) and usually provide a single model run, the presented approach directly reconstructs total TWS changes and includes hundreds of ensemble members which can be used to quantify predictive uncertainty. We compare these data-driven TWS estimates with other independent evaluation datasets such as the sea level budget, large-scale water balance from atmospheric reanalysis and in-situ streamflow measurements. We find that the presented approach performs overall as well or better than a set of state-of-the-art global hydrological models (Water Resources Reanalysis version 2). We provide reconstructed TWS anomalies at a spatial resolution of 0.5°, at both daily and monthly scales over the period 1901 to present, based on two different GRACE products and three different meteorological forcing datasets, resulting in 6 reconstructed TWS datasets of 100 ensemble members each. Possible user groups and applications include hydrological modelling and model benchmarking, sea level budget studies, assessments of long-term changes in the frequency of droughts, the analysis of climate signals in geodetic time series and the interpretation of the data gap between the GRACE and the GRACE Follow-On mission.Check reference for additional details and caveats.ReferenceHumphrey V. & Gudmundsson L. (submitted). GRACE-REC: A reconstruction of climate-driven water storage changes over the last century. Earth System Science Data Discussions.
Datasets for Markov motif analysis, in inhibitory and excitatory clustered er networks, er networks, and increased weight networks.
SPEECH-COCO is an augmentation of MS-COCO dataset where speech is added to image and text. Speech captions were generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (>600h) paired with images. Disfluencies and speed perturbation were added to the signal in order to sound more natural. Each speech signal (WAV) is paired with a JSON file containing exact timecode for each word/syllable/phoneme in the spoken caption. Such a corpus could be used for Language and Vision (LaVi) tasks including speech input or output instead of text.
The Chiang Saen city GIS dataset including space syntax's integration of street and point data of housing.
This dataset contains key characteristics about the data described in the Data Descriptor De novo transcriptome assembly and analysis of the freshwater araphid diatom Fragilaria radians, Lake Baikal. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format3. machine readable metadata file in ISA-Tab format (zipped folder)
This is a full archive of metadata about papers on arxiv.org from 1993-2018, including abstracts. Data is tidy and packed in TSV files, in two different collections of the total dataset: per year (all categories) and per primary category (all years). This archive also includes Jupyter notebooks for unpacking and analyzing it in python. See the README.md file and https://github.com/staeiou/arxiv_archive for more information.
An individual’s annual income results from various factors. Intuitively, it is influenced by the individual’s education level, age, gender, occupation, and etc. This is a widely cited KNN dataset. I encountered it during my course, and I wish to share it here because it is a good starter example for data pre-processing and machine learning practices. **Fields** The dataset contains 16 columns Target filed: Income -- The income is divide into two classes: <=50K and >50K Number of attributes: 14 -- These are the demographics and other features to describe a person We can explore the possibility in predicting income level based on the individual’s personal information. **Acknowledgements** This dataset named “adult” is found in the UCI machine learning repository [http://www.cs.toronto.edu/~delve/data/adult/desc.html][1] The detailed description on the dataset can be found in the original UCI documentation [http://www.cs.toronto.edu/~delve/data/adult/adultDetail.html][2] [1]: http://www.cs.toronto.edu/~delve/data/adult/desc.html [2]: http://www.cs.toronto.edu/~delve/data/adult/adultDetail.html
Context Text summarization is a way to condense the large amount of information into a concise form by the process of selection of important information and discarding unimportant and redundant information. With the amount of textual information present in the world wide web the area of text summarization is becoming very important. The extractive summarization is the one where the exact sentences present in the document are used as summaries. The extractive summarization is simpler and is the general practice among the automatic text summarization researchers at the present time. Extractive summarization process involves giving scores to sentences using some method and then using the sentences that achieve highest scores as summaries. As the exact sentence present in the document is used the semantic factor can be ignored which results in generation of less calculation intensive summarization procedure. This kind of summary is generally completely unsupervised and language independent too. Although this kind of summary does its job in conveying the essential information it may not be necessarily smooth or fluent. Sometimes there can be almost no connection between adjacent sentences in the summary resulting in the text lacking in readability. Content This dataset for extractive text summarization has four hundred and seventeen political news articles of BBC from 2004 to 2005 in the News Articles folder. For each articles, five summaries are provided in the Summaries folder. The first clause of the text of articles is the respective title. Acknowledgements This dataset was created using a dataset used for data categorization that onsists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005 used in the paper of D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006; whose all rights, including copyright, in the content of the original articles are owned by the BBC. More at http://mlg.ucd.ie/datasets/bbc.html
Context Satellite imagery provides unique insights into various markets, including agriculture, defense and intelligence, energy, and finance. New commercial imagery providers, such as [Planet](https://www.planet.com/), are using constellations of small satellites to capture images of the entire Earth every day. This flood of new imagery is outgrowing the ability for organizations to manually look at each image that gets captured, and there is a need for machine learning and computer vision algorithms to help automate the analysis process. The aim of this dataset is to help address the difficult task of detecting the location of large ships in satellite images. Automating this process can be applied to many issues including monitoring port activity levels and supply chain analysis. Content The dataset consists of image chips extracted from Planet satellite imagery collected over the San Francisco Bay and San Pedro Bay areas of California. It includes 4000 80x80 RGB images labeled with either a "ship" or "no-ship" classification. Image chips were derived from PlanetScope full-frame visual scene products, which are orthorectified to a 3 meter pixel size. Provided is a zipped directory `shipsnet.zip` that contains the entire dataset as .png image chips. Each individual image filename follows a specific format: {label} __ {scene id} __ {longitude} _ {latitude}.png - **label:** Valued 1 or 0, representing the "ship" class and "no-ship" class, respectively. - **scene id:** The unique identifier of the PlanetScope visual scene the image chip was extracted from. The scene id can be used with the [Planet API](https://www.planet.com/docs/reference/data-api/) to discover and download the entire scene. - **longitude_latitude:** The longitude and latitude coordinates of the image center point, with values separated by a single underscore. The dataset is also distributed as a JSON formatted text file `shipsnet.json`. The loaded object contains **data**, **label**, **scene_ids**, and **location** lists. The pixel value data for each 80x80 RGB image is stored as a list of 19200 integers within the **data** list. The first 6400 entries contain the red channel values, the next 6400 the green, and the final 6400 the blue. The image is stored in row-major order, so that the first 80 entries of the array are the red channel values of the first row of the image. The list values at index *i* in **labels**, **scene_ids**, and **locations** each correspond to the *i*-th image in the **data** list. Class Labels The "ship" class includes 1000 images. Images in this class are near-centered on the body of a single ship. Ships of different sizes, orientations, and atmospheric collection conditions are included. Example images from this class are shown below.  The "no-ship" class includes 3000 images. A third of these are a random sampling of different landcover features - water, vegetion, bare earth, buildings, etc. - that do not include any portion of an ship. The next third are "partial ships" that contain only a portion of an ship, but not enough to meet the full definition of the "ship" class. The last third are images that have previously been mislabeled by machine learning models, typically caused by bright pixels or strong linear features. Example images from this class are shown below.  Acknowledgements Satellite imagery used to build this dataset is made available through Planet\'s [Open California](https://www.planet.com/products/open-california/) dataset, which is [openly licensed](https://creativecommons.org/licenses/by-sa/4.0/). As such, this dataset is also available under the same CC-BY-SA license. Users can sign up for a free Planet account to search, view, and download thier imagery and gain access to their API.
The database presented here contains radiogenic neodymium and strontium isotope ratios measured on both terrestrial and marine sediments. It was compiled to help assessing sediment provenance and transport processes for various time intervals. This can be achieved by either mapping sediment isotopic signature and/or fingerprinting source areas using statistical tools (e.g. Blanchet, 2018b, 2018a). The database has been built by incorporating data from the literature and the SedDB database and harmonizing the metadata, especially units and geographical coordinates. The original data were processed in three steps. Firstly, a specific attention has been devoted to provide geographical coordinates to each sample in order to be able to map the data. When available, the original geographical coordinates from the reference (generally DMS coordinates, with different precision standard) were transferred into the decimal degrees system. When coordinates were not provided, an approximate location was derived from available information in the original publication. Secondly, all samples were assigned a set of standardized criteria that help splitting the dataset in specific categories. We defined categories associated with the sample location ("Region", "Sub-region", "Location", which relate to location at continental to city/river scale) or with the sample types (terrestrial samples – “aerosols”, “soil sediments”, “river sediments”, “rocks” - or marine samples –“marine sediment” or “trap sample”). Thirdly, samples were discriminated according to their deposition age, which allowed to compute average values for specific time intervals (see attached table "Age_determination_Sediment_Cores_V2.txt"). A first version of the database was published in September 2018 and presented data for the African sector. A second version was published in April 2019, in which the dataset has been extended to reach a global extent. The dataset will be further updated bi-annually to increase the geographical resolution and/or add other type of samples. This dataset consists of two tab separated tables: "Dataset_Nd_Sr_isotopes_V2.txt" and "Age_determination_Sediment_Cores_V2.txt". "Dataset_Nd_Sr_isotopes_V2.txt" contains the assembled dataset of marine and terrestrial Nd and/or Sr concentration and isotopes, together with sorting criteria and geographical locations. "Age_determination_Sediment_Cores_V2.txt" contains all background information concerning the determination of the isotopic signature of specific time intervals (depth interval, number of samples, mean and standard deviation). Column headers are explained in respective metadata comma-separated files. A full reference list is provided in the file “References_Database_Nd_Sr_isotopes_V2.rtf”. Finally, R code for mapping the data and running statistical analyses is also available for this dataset (Blanchet, 2018b, 2018a).
A number of zipped files containing datasets and supplemetary information focusing on the catalytic activity of silver nanoparticles. Supplementary information Figure 2 a) and b): Characterisation of Ag Ab NP's using extinction spectroscopy and dynamic light scattering. Dataset contains DSW files of Ag Ab NP's with different concentrations of Ab on the surface obtained using a Cary UV-Vis instrument and excel file with the extinction spectra plotted together and size information form the DLS. Figure 3 a): Catalytic activity of silver nanoparticles assessed with extinction spectroscopy. Dataset contains DSW files obtained using Cary 300 Bio UV-Vis spectrometer with Cary software. Also contains excel files with normalised extinction spectra for each sample and graph of all samples plotted together. Figure 3 b): Catalytic activity of silver nanoparticles. This data set contains SPC files for each sample taken in a 638 nm Raman spectrometer using Ocean optics software. Excel spreadsheet also included which has average spectrum for each sample and graph containing all averaged spectra. Figure 4: Catalytic activity of silver nanoparticles conjugated to antibodies. Data set includes SPC files for all samples taking on a 638 nm Raman spectrometer with Ocean optics software and excel file containing average spectra for each sample. Excel file also has graph of average SERRS spectrum for each sample plotted together Figure 5 a) and b): Analysis of SLISA. Dataset contains SPC files for each concentration and its replicates. SPC's obtained using a 638 nm laser excitation on an InVia Renishaw instrument. Dataset also contains excel with averages for each concentration, spectra of average plotted against each other and limit of detection graph.
Context To analyze the trend of the research is an interesting and important task. But many conferences does not publish its accepted paper list by useful format. For that reason, I share the recent [ACL](https://en.wikipedia.org/wiki/Association_for_Computational_Linguistics) accepted papers dataset! Content This dataset includes ACL accepted papers (long & short) from 2016 to 2018. * [ACL 2016 accepted papers](http://mirror.aclweb.org/acl2016/indexa779.html?article_id=68) * [ACL 2017 accepted papers](https://acl2017.wordpress.com/2017/04/05/accepted-papers-and-demonstrations/) * [ACL 2018 accepted papers](https://acl2018.org/programme/papers/) And if [arXiv](https://arxiv.org/) version exists, its summary and URL are acquired. The source code to get the dataset is shared on [GitHub](https://github.com/icoxfog417/get_acl_papers).
Microsatellites dataset used in an ABC-RF study to retrace invasion route of the Asian Longhorned Beetle.Specimens are sorted by populations
Decision trees are characterized by fast induction time and comprehensible classification rules. However, their classification accuracies are relatively lower in comparison to other black-box classifiers such as support vector machines. It is often possible to improve decision tree accuracies by combining them via boosting or bagging to form an ensemble of trees (i.e., forests). Unfortunately, ensemble approaches will cause the decision trees to lose their comprehensibility and significantly lengthen their induction time. The invention of the alternating decision tree (ADTree) allows the incorporation of boosting within a single decision tree to retain comprehensibility. However, the existing ADTree is univariate in nature which limits its potential to further improve the accuracy and induction time. This thesis presents the multivariate alternating decision tree, whereby multivariate decision nodes are incorporated into the ADTree learning algorithm. It can be considered as a generalization of the existing univariate ADTree. Three different variants of multivariate ADTrees are presented in this thesis, namely the Fisher’s ADTree, Sparse ADTree and regularized LogitBoost ADTree (rLADTree). These algorithms were benchmarked against other existing univariate, multivariate and ensemble-based decision trees using real-world datasets from the University of California, Irvine Machine Learning Repository and University of Eastern Finland Spectral Database. It is shown that the Fisher’s ADTree is capable of improving the accuracy of multivariate decision trees through boosting. At the same time it remains to be significantly smaller than boosted multivariate decision trees. It is also shown that the Sparse ADTree is a non-parametric extension of the Sparse Linear Discriminant Analysis (SLDA). It is therefore able to linearly partition the data when it is beneficial to do so, or to grow a tree to improve the classification accuracy when necessary. The most notable multivariate ADTree variant is the regularized LADTree, which is characterized by having no statistically significant differences in all performance metrics and offering comprehensibility when compared with the univariate, unboosted decision trees like C4.5 and CART for general datasets. For datasets with highly correlated features, the regularized LADTree outperforms the existing decision trees in terms of accuracy and comprehensibility, making it a top choice among decision tree classifiers.
Context If you like to eat cereal, do yourself a favor and avoid this dataset at all costs. After seeing these data it will never be the same for me to eat Fruity Pebbles again. Content Fields in the dataset: - Name: Name of cereal - mfr: Manufacturer of cereal - A = American Home Food Products; - G = General Mills - K = Kelloggs - N = Nabisco - P = Post - Q = Quaker Oats - R = Ralston Purina - type: - cold - hot - calories: calories per serving - protein: grams of protein - fat: grams of fat - sodium: milligrams of sodium - fiber: grams of dietary fiber - carbo: grams of complex carbohydrates - sugars: grams of sugars - potass: milligrams of potassium - vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended - shelf: display shelf (1, 2, or 3, counting from the floor) - weight: weight in ounces of one serving - cups: number of cups in one serving - rating: a rating of the cereals (Possibly from Consumer Reports?) Acknowledgements These datasets have been gathered and cleaned up by Petra Isenberg, Pierre Dragicevic and Yvonne Jansen. The original source can be found [here][1] This dataset has been converted to CSV Inspiration Eat too much sugary cereal? Ruin your appetite with this dataset! [1]: https://perso.telecom-paristech.fr/eagan/class/igr204/datasets
A simple data set to get started with data analysis using python pandas
This zipped file contains all input files for simulations, all input files for the bat dataset, and output files from the analyses of simulated data and the bat data
Newick tree file comprising 449 angiosperm taxa found in dataset. Tip labels are codes, identifiable to species by cross-referencing with species table.
Files related to the ML analysis of the complete dataset.
Dataset for:Micro and macroscale drivers of nutrient concentrations in streams in South, Central and North America PONE-D-16-20557
Dataset
Each column of this dataset represents the histogram of each image of the image dataset.
dataset for nipype tutorials at http://nipy.org/nipype/users/pipeline_tutorial.html
NFL-Statistics-Scrape Here are the basic statistics, career statistics and game logs provided by the NFL on their website (http://www.nfl.com) for all players past and present. Summary The data was scraped using a Python code. The code can be located at Github: https://github.com/kendallgillies/NFL-Statistics-Scrape Explanation of Data 1. The first main group of statistics is the basic statistics provided for each player. This data is stored in the CSV file titled Basic_Stats.csv along with the player’s name and URL identifier. If available the data pulled for each player is as follows: 1. Number 2. Position 3. Current Team 4. Height 5. Weight 6. Age 7. Birthday 8. Birth Place 9. College Attended 10. High School Attended 11. High School Location 12. Experience 2. The second main group of statistics gathered for each player are their career statistics. While each player has a main position they play, they will have statistics in other areas; therefore, the career statistics are divided into statistics types. The statistics are then stored in CSV files based on statistic type along with the player name, URL identifier and position (if available). The following are the career statistics types and accompanying CSV file names: 1. Defensive Statistics – Career_Stats_Defensive.csv 2. Field Goal Kickers - Career_Stats_Field_Goal_Kickers.csv 3. Fumbles - Career_Stats_Fumbles.csv 4. Kick Return - Career_Stats_Kick_Return.csv 5. Kickoff - Career_Stats_Kickoff.csv 6. Offensive Line - Career_Stats_Offensive_Line.csv 7. Passing - Career_Stats_Passing.csv 8. Punt Return - Career_Stats_Punt_Return.csv 9. Punting - Career_Stats_Punting.csv 10. Receiving - Career_Stats_Receiving.csv 11. Rushing - Career_Stats_Rushing.csv 3. The final group of statistics is the game logs for each player. The game logs are stored by position and have the player name, URL identifier and position (if available). The following are the game log types and accompanying CSV file names: 1. Quarterback – Game_Logs_Quarterback.csv 2. Running back – Game_Logs_Runningback.csv 3. Wide Receiver and Tight End – Game_Logs_Wide_Receiver_and_Tight_End.csv 4. Offensive Line – Game_Logs_Offensive_Line.csv 5. Defensive Lineman – Game_Logs_Defensive_Lineman.csv 6. Kickers – Game_Logs_Kickers.csv 7. Punters – Game_Logs_Punters.csv Glossary While most of the abbreviations used by the NFL have been translated in the table headers in the data files, there are still a couple of abbreviations used. * FG: Field Goal * TD: Touchdown * Int: Interception
MATLAB dataset. Full matrix capture data from the ultrasonic phased array inspection of welded 316L stainless steel plates containing a 6mm lack-of-fusion flaw at a 50 degree angle with respect to the x-axis. Further experimental details can be found in the attached metadata file. This data was collected under an RCNDE targeted project which involved collaboration between the Mathematics and Statistics Department at Strathclyde, the Centre of Ultrasonic Engineering at Strathclyde and 5 industrial partners (AMEC, NNL, Rolls Royce, Shell and Weidlinger). The dataset here will be made publicly accessible under EPSRC regulations.
The RDF triples below define a vocabulary for describing processes run on linked data datasets. This vocabulary was originally intended as administrative and provenance data for RDF datasets produced at the University of Washington Libraries.
As described on the original website: There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). The image is quantized to 256 grey levels and stored as unsigned 8-bit integers; the loader will convert these to floating point values on the interval [0, 1], which are easier to work with for many algorithms. The “target” for this database is an integer from 0 to 39 indicating the identity of the person pictured; however, with only 10 examples per class, this relatively small dataset is more interesting from an unsupervised or semi-supervised perspective. The original dataset consisted of 92 x 112, while the version available here consists of 64x64 images. Credit to AT&T Laboratories Cambridge for images
This dataset contains images of ten different knots tied with two types of climbing rope. The ten knots are: - The Alpine Butterfly Knot - The Bowline Knot - The Clove Hitch - The Figure-8 Knot - The Figure-8 Loop - The Fisherman's Knot - The Flemish Bend - The Overhand Knot - The Reef Knot - The Slip Knot Every knot was photographed in many different conditions. Each knot was photographed at four different z-axis rotations. Each knot was photographed in three different lighting conditions. Each knot was photographed at three different tensions. Each knot was photographed with two different backgrounds. Capturing each knot in these different conditions resulted in 144 images per knot and 1440 images in total for the entire 10Knots dataset. This dataset was originally created to train a convolutional neural network implemented in Keras to perform image classification and classify these ten different knots.
Context This dataset contains lot of historical sales data. It was extracted from a Brazilian top retailer and has many SKUs and many stores. The data was transformed to protect the identity of the retailer. Content [TBD] Acknowledgements This data would not be available without the full collaboration from our customers who understand that sharing their core and strategical information has more advantages than possible hazards. They also support our continuos development of innovative ML systems across their value chain. Inspiration Every retail business in the world faces a fundamental question: how much inventory should I carry? In one hand to mush inventory means working capital costs, operational costs and a complex operation. On the other hand lack of inventory leads to lost sales, unhappy customers and a damaged brand. Current inventory management models have many solutions to place the correct order, but they are all based in a single unknown factor: the demand for the next periods. This is why short-term forecasting is so important in retail and consumer goods industry. We encourage you to seek for the best demand forecasting model for the next 2-3 weeks. This valuable insight can help many supply chain practitioners to correctly manage their inventory levels.
Dataset connected with article: 'Increasing the use of conceptually-derived strategies in arithmetic: using inversion problems to promote the use of associativity shortcuts.' Abstract: Conceptual knowledge of key principles underlying arithmetic is an important precursor to understanding algebra and later success in mathematics. One such principle is associativity, which allows individuals to solve problems in different ways by decomposing and recombining subexpressions (e.g. ‘a + b – c’ = ‘b – c + a’). More than any other principle, children and adults alike have difficulty understanding it, and educators have called for this to change. We report three intervention studies that were conducted in university classrooms to investigate whether adults’ use of associativity could be improved. In all three studies, it was found that those who first solved inversion problems (e.g. ‘a + b – b’) were more likely than controls to then use associativity on ‘a + b – c’ problems. We suggest that ‘a + b – b’ inversion problems may either direct spatial attention to the location of ‘b – c’ on associativity problems, or implicitly communicate the validity and efficiency of a right-to-left strategy. These findings may be helpful for those designing brief activities that aim to aid the understanding of arithmetic principles and algebra.
Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Supplemental Data. Walker et al. (2017). Plant Cell 10.1105/tpc.16.00961. Supplemental Dataset 8. Gene family regulation across cell types and treatments. (A) Summary of gene family regulation: number DE/non-DE, in which timeseries they are regulated and if transcripts are DE in one or both cell types. (B) Lists of the genes in each family together with cluster numbers if differentially expressed in a timeseries.
Context Things like Block chain, Bitcoin, Bitcoin cash, Ethereum, Ripple etc are constantly coming in the news articles I read. So I wanted to understand more about it and [this post][1] helped me get started. Once the basics are done, the data scientist inside me started raising questions like: 1. How many cryptocurrencies are there and what are their prices and valuations? 2. Why is there a sudden surge in the interest in recent days? For getting answers to all these questions (and if possible to predict the future prices ;)), I started collecting data from [coinmarketcap][2] about the cryptocurrencies. So what next? Now that we have the price data, I wanted to dig a little more about the factors affecting the price of coins. I started of with Bitcoin and there are quite a few parameters which affect the price of Bitcoin. Thanks to [Blockchain Info][3], I was able to get quite a few parameters on once in two day basis. This will help understand the other factors related to Bitcoin price and also help one make future predictions in a better way than just using the historical price. Content The dataset has one csv file for each currency. Price history is available on a daily basis from April 28, 2013. This dataset has the historical price information of some of the top crypto currencies by market capitalization. The currencies included are: - Bitcoin - Ethereum - Ripple - Bitcoin cash - Bitconnect - Dash - Ethereum Classic - Iota - Litecoin - Monero - Nem - Neo - Numeraire - Stratis - Waves - Date : date of observation - Open : Opening price on the given day - High : Highest price on the given day - Low : Lowest price on the given day - Close : Closing price on the given day - Volume : Volume of transactions on the given day - Market Cap : Market capitalization in USD **Bitcoin Dataset (bitcoin_dataset.csv) :** This dataset has the following features. - Date : Date of observation - btc_market_price : Average USD market price across major bitcoin exchanges. - btc_total_bitcoins : The total number of bitcoins that have already been mined. - btc_market_cap : The total USD value of bitcoin supply in circulation. - btc_trade_volume : The total USD value of trading volume on major bitcoin exchanges. - btc_blocks_size : The total size of all block headers and transactions. - btc_avg_block_size : The average block size in MB. - btc_n_orphaned_blocks : The total number of blocks mined but ultimately not attached to the main Bitcoin blockchain. - btc_n_transactions_per_block : The average number of transactions per block. - btc_median_confirmation_time : The median time for a transaction to be accepted into a mined block. - btc_hash_rate : The estimated number of tera hashes per second the Bitcoin network is performing. - btc_difficulty : A relative measure of how difficult it is to find a new block. - btc_miners_revenue : Total value of coinbase block rewards and transaction fees paid to miners. - btc_transaction_fees : The total value of all transaction fees paid to miners. - btc_cost_per_transaction_percent : miners revenue as percentage of the transaction volume. - btc_cost_per_transaction : miners revenue divided by the number of transactions. - btc_n_unique_addresses : The total number of unique addresses used on the Bitcoin blockchain. - btc_n_transactions : The number of daily confirmed Bitcoin transactions. - btc_n_transactions_total : Total number of transactions. - btc_n_transactions_excluding_popular : The total number of Bitcoin transactions, excluding the 100 most popular addresses. - btc_n_transactions_excluding_chains_longer_than_100 : The total number of Bitcoin transactions per day excluding long transaction chains. - btc_output_volume : The total value of all transaction outputs per day. - btc_estimated_transaction_volume : The total estimated value of transactions on the Bitcoin blockchain. - btc_estimated_transaction_volume_usd : The estimated transaction value in USD value. **Ethereum Dataset (ethereum_dataset.csv):** This dataset has the following features - Date(UTC) : Date of transaction - UnixTimeStamp : unix timestamp - eth_etherprice : price of ethereum - eth_tx : number of transactions per day - eth_address : Cumulative address growth - eth_supply : Number of ethers in supply - eth_marketcap : Market cap in USD - eth_hashrate : hash rate in GH/s - eth_difficulty : Difficulty level in TH - eth_blocks : number of blocks per day - eth_uncles : number of uncles per day - eth_blocksize : average block size in bytes - eth_blocktime : average block time in seconds - eth_gasprice : Average gas price in Wei - eth_gaslimit : Gas limit per day - eth_gasused : total gas used per day - eth_ethersupply : new ether supply per day - eth_chaindatasize : chain data size in bytes - eth_ens_register : Ethereal Name Service (ENS) registrations per day Acknowledgements This data is taken from [coinmarketcap][5] and it is [free][6] to use the data. Bitcoin dataset is obtained from [Blockchain Info][7]. Ethereum dataset is obtained from [Etherscan][8]. Cover Image : Photo by Thomas Malama on Unsplash Inspiration Some of the questions which could be inferred from this dataset are: 1. How did the historical prices / market capitalizations of various currencies change over time? 2. Predicting the future price of the currencies 3. Which currencies are more volatile and which ones are more stable? 4. How does the price fluctuations of currencies correlate with each other? 5. Seasonal trend in the price fluctuations Bitcoin / Ethereum dataset could be used to look at the following: 1. Factors affecting the bitcoin / ether price. 2. Directional prediction of bitcoin / ether price. (refer [this paper][9] for more inspiration) 3. Actual bitcoin price prediction. [1]: https://www.linkedin.com/pulse/blockchain-absolute-beginners-mohit-mamoria [2]: https://coinmarketcap.com/ [3]: https://blockchain.info/ [4]: https://etherscan.io/charts [5]: https://coinmarketcap.com/ [6]: https://coinmarketcap.com/faq/ [7]: https://blockchain.info/ [8]: https://etherscan.io/charts [9]: http://cs229.stanford.edu/proj2014/Isaac%20Madan,%20Shaurya%20Saluja,%20Aojia%20Zhao,Automated%20Bitcoin%20Trading%20via%20Machine%20Learning%20Algorithms.pdf
it is a dataset with defect4j, bugs-dot-jar and the extended dataset from ye[fse'14].
Nexus file for Dataset 2, 38-taxon dataset.
Context Competitions like LUNA (http://luna16.grand-challenge.org) and the Kaggle Data Science Bowl 2017 (https://www.kaggle.com/c/data-science-bowl-2017) involve processing and trying to find lesions in CT images of the lungs. In order to find disease in these images well, it is important to first find the lungs well. This dataset is a collection of 2D and 3D images with manually segmented lungs. Challenge Come up with an algorithm for accurately segmenting lungs and measuring important clinical parameters (lung volume, PD, etc) Percentile Density (PD) The PD is the density (in Hounsfield units) the given percentile of pixels fall below in the image. The table includes 5 and 95% for reference. For smokers this value is often high indicating the build up of other things in the lungs.
photos of furniture
High-throughput experimental data are accumulating exponentially in public databases. Unfortunately, however, mining valid scientific discoveries from these abundant resources is hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed “batch effects,” and the latter is often modeled by subtypes. Existing methods either tackle batch effects provided that subtypes are known or cluster subtypes assuming that batch effects are absent. Consequently, there is a lack of research on the correction of batch effects with the presence of unknown subtypes. Here, we combine a location-and-scale adjustment model and model-based clustering into a novel hybrid one, the batch-effects-correction-with-unknown-subtypes model (BUS). BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, (d) allowing the number of subtypes to vary from batch to batch, (e) integrating batches from different platforms, and (f) enjoying a linear-order computation complexity. We prove the identifiability of BUS and provide conditions for study designs under which batch effects can be corrected. BUS is evaluated by simulation studies and a real breast cancer dataset combined from three batches measured on two platforms. Results from the breast cancer dataset offer much better biological insights than existing methods. We implement BUS as a free Bioconductor package BUScorrect. Supplementary materials for this article are available online.
We used different sensing techniques including time-lapse imagery, electric conductivity and stage measurements to generate a combined dataset of presence and absence of streamflow within a large number of nested sub-catchments in the Attert Catchment, Luxembourg. The first sites of observation were established in 2013 and successively extended to a total number of 182 in 2016 as part of the project “Catchments As Organized Systems” (CAOS, Zehe et al., 2014). Setup for time-lapse imagery measurements was inspired by Gilmore et al. (2013) while the setup for EC-sensor was proposed by Chapin et al. (2014). Temporal resolution ranged from 5 to 15 minutes intervals. Each single dataset was carefully processed and quality controlled before the time interval was homogenized to 30 minutes. The dataset provides valuable information of the dynamics of a meso-scale stream network in space and time. The Attert basin is located in the border region of Luxembourg and Belgium and covers an area of 247 km². The elevation of the catchment ranges from 245 m a.s.l. in Useldange to 549 m a.s.l. in the Ardennes. Climate conditions across the catchment are rather similar in terms of temperature and precipitation. Hydrological regimes are mainly driven by seasonal fluctuations in evapotranspiration causing flow to cease in intermittent reaches during dry periods. The catchment covers three predominant geologies: Slate, Marls and Sandstone. The dataset features data from catchments covering all geological characteristics from single geology to mixed geology. It can be used to test and evaluate hydrologic models, but also for the assessment of the intermittent stream ecosystem in the Attert basin.
Purple rapeseed leaves dataset contains training set and test set, both of which included RGB images which were cropped using UAV orthoimage and the corresponding labels, with the size of 256 × 256 pixels.The codes for the U-Net model used in this experiment was also assigned to the folder.
Context Based on Fisher\'s linear discriminant model, this data set became a typical test case for many statistical classification techniques in machine learning such as support vector machines. Content The Iris flower data set or Fisher\'s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.[1] It is sometimes called Anderson\'s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.[2] Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".[3] The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. Acknowledgements description taken from Wiki Would like to thank Dr. Jason Brownlee who has explained all the examples very nicely and clearly!
Copyright information:Taken from "Combining gene expression data from different generations of oligonucleotide arrays"BMC Bioinformatics 2004;5():159-159.Published online 25 Oct 2004PMCID:PMC528726.Copyright © 2004 Hwang et al; licensee BioMed Central Ltd. The same RNA was hybridized on both HG-U95Av2 and HG-U133A arrays, for 14 samples. Three methods for matching the probes were considered, but the two datasets gave highly inconsistent results in cluster analysis and identification of differentially expressed genes. To improve the comparability in general, probe-level sequence information was exploited. All 25-mer probes were aligned to human genome sequences by BLAT and then filtered based on the length of their overlap with the probes on the other array. New expression indices were calculated using only the selected probes, and this results in higher reproducibility.
Bank Marketing **Abstract:** The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). **Data Set Information:** The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. Attribute Information: Bank client data: - Age (numeric) - Job : type of job (categorical: 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown') - Marital : marital status (categorical: 'divorced', 'married', 'single', 'unknown' ; note: 'divorced' means divorced or widowed) - Education (categorical: 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown') - Default: has credit in default? (categorical: 'no', 'yes', 'unknown') - Housing: has housing loan? (categorical: 'no', 'yes', 'unknown') - Loan: has personal loan? (categorical: 'no', 'yes', 'unknown') Related with the last contact of the current campaign: - Contact: contact communication type (categorical: 'cellular','telephone') - Month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec') - Day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri') - Duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. Other attributes: - Campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) - Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) - Previous: number of contacts performed before this campaign and for this client (numeric) - Poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success') Social and economic context attributes - Emp.var.rate: employment variation rate - quarterly indicator (numeric) - Cons.price.idx: consumer price index - monthly indicator (numeric) - Cons.conf.idx: consumer confidence index - monthly indicator (numeric) - Euribor3m: euribor 3 month rate - daily indicator (numeric) - Nr.employed: number of employees - quarterly indicator (numeric) Output variable (desired target): - y - has the client subscribed a term deposit? (binary: 'yes', 'no') Analysis Steps: - Atribute information Analysis. - Machine Learning (Logistic Regression, KNN, SVM, Decision Tree, Random Forest, Naive Bayes) - Deep Learning (ANN) Source: - Dataset from : http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Normalized full metabolic dataset. Table S2. Seed weight in response to salinity for both seasons. Table S3. Putative QTLs for maturation percent. Table S4. Putative QTLs for RMC in SDF. Table S5. Putative QTLs for RMC in SDS. (XLSX 882 kb)
Context This dataset was downloaded from INEP, a department from the Brazilian Education Ministry. It contains data from the applicants for the 2016 National High School Exam. Content Inside this dataset there are not only the exam results, but the social and economic context of the applicants. Acknowledgements The original dataset is provided by INEP (http://portal.inep.gov.br/microdados). Inspiration The objective is to explore the dataset to achieve a better understanding of the social and economic context of the applicants in the exams results.
Context MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota. This data set consists of: * 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies. * Simple demographic info for the users (age, gender, occupation, zip) The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Detailed descriptions of the data file can be found at the end of this file. Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
This dataset includes ANSI/IES TM-30-18 data for approximately 165,000 light source spectral power distributions.
Context Most countries of the world define poverty as a lack of money. Yet poor people themselves consider their experience of poverty much more broadly. A person who is poor can suffer from multiple disadvantages at the same time – for example they may have poor health or malnutrition, a lack of clean water or electricity, poor quality of work or little schooling. Focusing on one factor alone, such as income, is not enough to capture the true reality of poverty. Multidimensional poverty measures can be used to create a more comprehensive picture. They reveal who is poor and how they are poor – the range of different disadvantages they experience. As well as providing a headline measure of poverty, multidimensional measures can be broken down to reveal the poverty level in different areas of a country, and among different sub-groups of people. Content Most recent MPI data harmonized for comparisons across time. OPHI researchers apply the AF method and related multidimensional measures to a range of different countries and contexts. Their analyses span a number of different topics, such as changes in multidimensional poverty over time, comparisons in rural and urban poverty, and inequality among the poor. For more information on OPHI’s research, see our [working paper series](http://www.ophi.org.uk/resources/ophi-working-papers/) and [research briefings](http://www.ophi.org.uk/resources/briefing-documents/). OPHI also calculates the Global Multidimensional Poverty Index [MPI](http://www.ophi.org.uk/multidimensional-poverty-index/), which has been published since 2010 in the United Nations Development Programme’s Human Development Report. The Global MPI is an internationally-comparable measure of acute poverty covering more than 100 developing countries. It is updated by OPHI twice a year and constructed using the AF method. The Alkire Foster (AF) method is a way of measuring multidimensional poverty developed by OPHI’s Sabina Alkire and James Foster. Building on the Foster-Greer-Thorbecke poverty measures, it involves counting the different types of deprivation that individuals experience at the same time, such as a lack of education or employment, or poor health or living standards. These deprivation profiles are analysed to identify who is poor, and then used to construct a multidimensional index of poverty (MPI). For free online video guides on how to use the AF method, see [OPHI’s online training portal](http://www.ophi.org.uk/teaching/online-training-portal/). To identify the poor, the AF method counts the overlapping or simultaneous deprivations that a person or household experiences in different indicators of poverty. The indicators may be equally weighted or take different weights. People are identified as multidimensionally poor if the weighted sum of their deprivations is greater than or equal to a poverty cut off – such as 20%, 30% or 50% of all deprivations. It is a flexible approach which can be tailored to a variety of situations by selecting different dimensions (e.g. education), indicators of poverty within each dimension (e.g. how many years schooling a person has) and poverty cut offs (e.g. a person with fewer than five years of education is considered deprived). The most common way of measuring poverty is to calculate the percentage of the population who are poor, known as the headcount ratio (H). Having identified who is poor, the AF method generates a unique class of poverty measures (Mα) that goes beyond the simple headcount ratio. Three measures in this class are of high importance: Adjusted headcount ratio (M0), otherwise known as the MPI: This measure reflects both the incidence of poverty (the percentage of the population who are poor) and the intensity of poverty (the percentage of deprivations suffered by each person or household on average). M0 is calculated by multiplying the incidence (H) by the intensity (A). M0 = H x A. Find out about other ways the AF method is used in [research and policy](http://www.ophi.org.uk/research/multidimensional-poverty/research-applications/). Additional data [here](http://ophi.org.uk/multidimensional-poverty-index/global-mpi-2017/mpi-data/). This dataset contains the [Summer 2016 Subnational data from Table 6.3](http://ophi.org.uk/multidimensional-poverty-index/mpi-resources/2016) as it is the most recent dataset for MPI comparisons over time. Data Cleaning Notes The original format was significantly different in many unusable ways. I converted all survey years (`year1` and `year2`) from a `Period` format that looked some like "2005/6 - 2009". Note, "2005/6" meant the survey was conducted from sometime in 2005 through sometime in 2006. Additionally, the `year2` aspect could follow a similar format (eg "2009/10"). To keep simplicity, I dropped the `/%s` portion of both `year1` and `year2`. This still maintains consistency in the case that either year column becomes used for a comparison statistic. The raw data file from OPHI has their `Total Population` and `Number of Poor` in Thousands. I converted the decimals to make it a raw population number. For example, `3.142` becomes `3142`. The original file is in an excel format that needed to be converted into a `csv` in order to upload into Kaggle. I decided to keep values to the`10^-6` decimal place. The statistical significance columns comes from OPHI\'s test of significant changes. Directly from the excel file: `Note, *** statistically significant at α=0.01, ** statistically significant at α=0.05, * statistically significant at α=0.10` Acknowledgements Alkire, S. and Robles, G. (2017). “Multidimensional Poverty Index Summer 2017: Brief methodological note and results.” OPHI Methodological Note 44, University of Oxford. Alkire, S. and Santos, M. E. (2010). “Acute multidimensional poverty: A new index for developing countries.” OPHI Working Papers 38, University of Oxford. Alkire, S. Jindra, C. Robles, G. and Vaz, A. (2017). ‘Multidimensional Poverty Index – Summer 2017: brief methodological note and results’. OPHI MPI Methodological Notes No. 44, Oxford Poverty and Human Development Initiative, University of Oxford. [OPHI Kaggle\'s Page](https://www.kaggle.com/ophi/mpi) Inspiration Further evaluate OPHI\'s approach to comparing subnational regions for various years. Then, consider how much Kiva\'s microcredit impacted the subnational MPI change.
Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
This data was donated by researchers of the University of Wisconsin and includes the measurements from digitized images of fine-needle aspirate of a breast mass. You can find the dataset at https://github.com/dataspelunking/MLwR/blob/master/Machine%20Learning%20with%20R%20(2nd%20Ed.)/Chapter%2003/wisc_bc_data.csv. The breast cancer data includes 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis and 30 are numeric-valued laboratory measurements. The diagnosis is coded as "M" to indicate malignant or "B" to indicate benign. The other 30 numeric measurements comprise the mean, standard error and worst (i.e. largest) value for 10 different characteristics of the digitized cell nuclei, which are as follows:- - Radius - Texture - Perimeter - Area - Smoothness - Compactness - Concavity - Concave Points - Symmetry - Fractal dimension
Context There's a story behind every dataset and here's your opportunity to share yours. Content Pima Indian Diabetes Data Acknowledgements Jerry Kurata Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Context Image Segmentation is a complicated problem that often cannot be performed in a fully automatic manner. We use this dataset as a way for testing and exploring methods to make such semi-automatic segmentation work better Content 151 images with full segmentations and paint strokes (compiled by: http://www.robots.ox.ac.uk/~vgg/data/iseg/) Acknowledgements Visual Graphics Group at Oxford for Compiling the data GrabCut Dataset from Microsoft PASCAL Dataset Alpha Matting Dataset Inspiration - How well do different techniques work at expanding the initial labels to a full segmentation? - Which techniques are quick enough to run in real time (the paint strokes are normally given and the user waits for feedback, they can't be precomputed) - Are any of these techniques easy to implement in the JavaScript so they could be browser-based?
Facial keypoints detection -> improving prediction
Context Kaggle kernels have no internet connectivity, everything you use must be a dataset. Content A pair of african elephants. Acknowledgements Taken from here: https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/5.4-visualizing-what-convnets-learn.ipynb Inspiration It's basically a toy, just for testing and learning purpose.
This dataset contains Version 2.3 of the Global Precipitation Climatology Project (GPCP) Monthly Analysis Product. The data are monthly analyses defined on a global 2.5 degree by 2.5 degree longitude/latitude grid and cover the period January 1979 to (delayed) present.
The objective of the BRFSS is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases in the adult population. Factors assessed by the BRFSS include tobacco use, health care coverage, HIV/AIDS knowledge or prevention, physical activity, and fruit and vegetable consumption. Data are collected from a random sample of adults (one per household) through a telephone survey. The Behavioral Risk Factor Surveillance System (BRFSS) is the nation's premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world. Content - Each year contains a few hundred columns. Please see one of the [annual code books][1] for complete details. - These CSV files were converted from a SAS data format using pandas; there may be some data artifacts as a result. - If you like this data, you might also enjoy [the 2011-2015 batch][2]. Please note that those years use a different format. Acknowledgements This dataset was released by the CDC. You can find the original dataset, manuals, and [additional years of data here][3]. [1]: https://www.cdc.gov/brfss/annual_data/2001/pdf/codebook_01.pdf [2]: https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system [3]: https://www.cdc.gov/brfss/annual_data/annual_data.htm
The existing code-based program implemented in GitHub portal provides a great tool for scientists and students for data sharing and notification of the co-workers, tutors and supervisors involved in research about actual updates. It enables to connect collaborators to share around current results, release datasets and updates and many more. Using standard command-line interface GitHub allows registered users to push repositories on the site. The availability of both public and private repositories enables to share current data updates with target audience: e.g., unpublished research work only for co-authors or supervisors, or, vice versa, successfully defended Fig.1. Fragment of the text written using LaTeX and processed by Git. Therefore, there is a need in academic centers and universities to strongly popularize and increase the use of GitHub for student works. The case study is given on the graduate study: an MSc work successfully written and maintained using open source GitHub service at the University of Twente, Faculty of Geo-Information Science and Earth Observation (Netherlands) entitled “Seagrass monitoring and mapping along the coasts of Greece, Crete”. Current presentation reports my own experience of management and organization of MSc thesis project. In spite of traditional and highly ineffective tool of MS Word, I used the effective combination of LaTeX tools with GitHub for data thesis is open for public. However, despite the evident usefulness and perspectives of GitHub, the existing users of GitHub mostly include the programmer communities and IT specialists. Therefore, there is a need in academic centers and universities to strongly popularize and increase the use of GitHub for student works. The case study is given on the graduate study: an MSc work successfully written and maintained using open source GitHub service at the University of Twente, Faculty of Geo-Information Science and Earth Observation (Netherlands) entitled “Seagrass monitoring and mapping along the coasts of Greece, Crete”.
This dataset contains key characteristics about the data described in the Data Descriptor Longitudinal dataset of human-building interactions in U.S. offices. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format
Dataset containing observations of sea turtle and the climatic data related. The data is from Portal da Biodiversidade (https://portaldabiodiversidade.icmbio.gov.br), GBIF and Bio-Oracle (http://bio-oracle.org). The data from Portal da Biodiversidade contains observations of sea turtles in Brazil, colected by researchers from all over the country. The GBIF data contains observations from various countries. The geophysical, biotic and environmental data for surface marine realms was exported from Bio-Oracle marine data layers. The columns in the file mean: BO_calcite - Calcite (mol.m-3) BO_chlomax - Chlorophyll (mg.m-3) BO_chlomean - Chlorophyll (mg.m-3) BO_chlomin - Chlorophyll (mg.m-3) BO_chlorange - Chlorophyll (mg.m-3) BO_cloudmax - Cloud cover (%) BO_cloudmean - Cloud cover (%) BO_cloudmin - Cloud cover (%) BO_damax - Diffuse attenuation (m-1) BO_damean - Diffuse attenuation (m-1) BO_damin - Diffuse attenuation (m-1) BO_dissox - Dissolved molecular oxygen (mol.m-3) BO_nitrate - Nitrate (mol.m-3) BO_parmax - Photosynt. Avail. Radiation (E.m-2.day-1) BO_parmean - Photosynt. Avail. Radiation (E.m-2.day-1) BO_ph - pH BO_phosphate - Phosphate (mol.m-3) BO_salinity - Salinity (PSS) BO_silicate - Silicate (mol.m-3) BO_sstmax - Temperature (ºC) BO_sstmean - Temperature (ºC) BO_sstmin - Temperature (ºC) BO_sstrange - Temperature (ºC) BO_bathymin - Bathymetry (m) BO_bathymax - Bathymetry (m) BO_bathymean - Bathymetry (m) More information: www.bio-oracle.org
Supplemental Dataset
Bibliographic dataset compiling all Scopus records relevant to Agent-based Complex Systems (ACS) science (i.e. the fields of agent-based and individual-based modelling). This dataset was post-processed and used to create a citation graph using the Diderot R package (https://cran.r-project.org/package=Diderot).
resolution: 512*512*273; unsigned short; a 0.3mm resolution.This dataset was obtained using the dental CBCT imaging system ZCB100 (Shenzhen ZhongKe TianYue Technology Co., Ltd.) at 110 kV and 10 mAs, with a 15.36-cm FOV
Reference in the dataset
Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Reconstructed slices of a 3D tomographic dataset after ring artifacts suppression using algorithm 6, algorithm 5, and algorithm 3 of our approaches.
This dataset contains information on UK university lecture capture policies as of April 2018. The dataset was used for the paper "Employee Surveillance: The Road to Surveillance is Paved with Good Intentions", by Lilian Edwards, Laura Martin and Tristan Henderson, accepted for presentation at the Amsterdam Privacy Conference, October 2018.
Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
ChinaCropPhen1km: A high-resolution crop phenological dataset for three staple crops in China during 2000-2015 based on LAI products The data file is in tif format, and the file name is named "crop type"+"doy"+"key phenological stages"+".tif". Among them, the crop type takes values 1, 2, and 3, representing maize, wheat, and rice, respectively. The key phenological stage has a value of 1, 2, and 3. The specific meaning is different for different crops. For wheat, the key growth period is followed by green-up (emergence), heading and maturity date, while for maize, the key growth period is followed by three-leaf (V3) stage, heading and maturity date. For rice, the key growth period is followed by transplanting stage, heading and maturity date. The data with a spatial resolution of 1 km.
File may be viewed using ProSeq software at http://dps.plants.ox.ac.uk/sequencing/proseq.htm [Filatov DA (2009) Processing and population genetic analysis of multigenic datasets with ProSeq3 software. Bioinformatics 25: 3189-3190].
Sequence datasets used in phylogenetic analyses. Allele files produced by pyRAD are provided. See text for dataset nomenclature
prescription dataset.
Context It is a well known fact that Millenials LOVE Avocado Toast. It's also a well known fact that all Millenials live in their parents basements. Clearly, they aren't buying home because they are buying too much Avocado Toast! But maybe there's hope... if a Millenial could find a city with cheap avocados, they could live out the Millenial American Dream. Content This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Here's how the [Hass Avocado Board describes the data on their website][1]: > The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table. Some relevant columns in the dataset: - `Date` - The date of the observation - `AveragePrice` - the average price of a single avocado - `type` - conventional or organic - `year` - the year - `Region` - the city or region of the observation - `Total Volume` - Total number of avocados sold - `4046` - Total number of avocados with PLU 4046 sold - `4225` - Total number of avocados with PLU 4225 sold - `4770` - Total number of avocados with PLU 4770 sold Acknowledgements Many thanks to the Hass Avocado Board for sharing this data!! http://www.hassavocadoboard.com/retail/volume-and-price-data Inspiration In which cities can millenials have their avocado toast AND buy a home? Was the Avocadopocalypse of 2017 real? [1]: http://www.hassavocadoboard.com/retail/volume-and-price-data
This dataset contains key characteristics about the data described in the Data Descriptor A dataset of cetacean occurrences in the Eastern North Atlantic. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format
Context Each line in LiveStreaming records a sequence of browsed items of a user in ascending order of time. Each number represents a unique itemID.
Context While implementing the paper https://arxiv.org/pdf/1511.05440.pdf, we realized that all the datasets available are either have image data but are very big, or have videos from which frames have to be extracted manually. In order to fix that, we created this dataset. Content This dataset contains sequences of images extracted from the starting scene of the movie "The Hobbit: An Unexpected Journey". While implementing the paper https://arxiv.org/pdf/1511.05440.pdf, we realized that all the datasets available are either have image data but are very big, or have videos from which frames have to be extracted manually. In order to fix that, we created this dataset. Also, an implementation of the paper using this dataset can be found here: https://github.com/akshaybapat04/video_prediction Acknowledgements We used GOM Player to take snapshots of the video. https://github.com/akshaybapat04/video_prediction Inspiration Can you predict the next image in an image sequence?
Copyright information:Taken from "Function-informed transcriptome analysis of renal tubule"Genome Biology 2004;5(9):R69-R69.Published online 26 Aug 2004PMCID:PMC522876.Copyright © 2004 Wang et al.; licensee BioMed Central Ltd. Genes enriched in tubules are historically under-researched. The percentage of genes with explicit names (other than automatic CG annotations) is shown for the entire genome, and for the top 50, 100 and 200 genes (as judged by fold enrichment) from the tubule dataset.
Machine learning has emerged as a discipline that enables computers to assist humans in making sense of large and complex data sets. With the drop-in cost of sequencing technologies, large amounts of omics data are being generated and made accessible to researchers. Analysing these complex high-volume data is not trivial and the use of classical tools cannot explore their full potential. Machine learning can thus be very useful in mining large omics datasets to uncover new insights that can advance the field of medicine and improve health care.The aim of this tutorial is to introduce participants to the Machine learning (ML) taxonomy and common machine learning algorithms. The tutorial will cover the methods being used to analyse different omics data sets by providing a practical context through the use of basic but widely used R and Python libraries. The tutorial will comprise a number of hands on exercises and challenges, where the participants will acquire a first understanding of the standard ML processes as well as the practical skills in applying them on familiar problems and publicly available real-world data sets.
@page { size: 8.5in 11in; margin: 0.79in } p { margin-bottom: 0.1in; direction: ltr; color: 000000; line-height: 115%; text-align: left; orphans: 2; widows: 2; background: transparent } p.western { font-family: "Liberation Serif", serif; font-size: 12pt; so-language: en-US } p.cjk { font-family: "Noto Sans CJK SC Regular"; font-size: 12pt; so-language: zh-CN } p.ctl { font-family: "Lohit Devanagari"; font-size: 12pt; so-language: hi-IN } Dataset S1. Assembly of S1 and S2 specimens separated into bins.
Raw data for the Hay Lake pollen dataset obtained from the Neotoma Paleoecological Database.
The objective of this work was to pre-process the Soil Landscapes of Canada (SLC) database to offer a country-level soils dataset in a format ready to be used in SWAT simulations. A two-level screening process was used to identify critical information required by SWAT and to remove records with information that could not be calculated or estimated. Out of the 14,063 unique soils in the SLC, 11,838 soils with complete information were included in the dataset presented here. Soils with missing records for the required SWAT variables were removed from the analysis. These soils were compiled into a soils list provided as a reference ("incomplete" dataset).
The ultimate Soccer database for data analysis and machine learning ------------------------------------------------------------------- **What you get:** - +25,000 matches - +10,000 players - 11 European Countries with their lead championship - Seasons 2008 to 2016 - Players and Teams\' attributes* sourced from EA Sports\' FIFA video game series, including the weekly updates - Team line up with squad formation (X, Y coordinates) - Betting odds from up to 10 providers - Detailed match events (goal types, possession, corner, cross, fouls, cards etc...) for +10,000 matches **16th Oct 2016: New table containing teams\' attributes from FIFA !* ---------- **Original Data Source:** You can easily find data about soccer matches but they are usually scattered across different websites. A thorough data collection and processing has been done to make your life easier. **I must insist that you do not make any commercial use of the data**. The data was sourced from: - [http://football-data.mx-api.enetscores.com/][1] : scores, lineup, team formation and events - [http://www.football-data.co.uk/][2] : betting odds. [Click here to understand the column naming system for betting odds:][3] - [http://sofifa.com/][4] : players and teams attributes from EA Sports FIFA games. *FIFA series and all FIFA assets property of EA Sports.* > When you have a look at the database, you will notice foreign keys for > players and matches are the same as the original data sources. I have > called those foreign keys "api_id". ---------- **Improving the dataset:** You will notice that some players are missing from the lineup (NULL values). This is because I have not been able to source their attributes from FIFA. This will be fixed overtime as the crawling algorithm is being improved. The dataset will also be expanded to include international games, national cups, Champion\'s League and Europa League. Please ask me if you\'re after a specific tournament. > Please get in touch with me if you want to help improve this dataset. [CLICK HERE TO ACCESS THE PROJECT GITHUB][5] *Important note for people interested in using the crawlers:* since I first wrote the crawling scripts (in python), it appears sofifa.com has changed its design and with it comes new requirements for the scripts. The existing script to crawl players (\'Player Spider\') will not work until i\'ve updated it. ---------- Exploring the data: Now that\'s the fun part, there is a lot you can do with this dataset. I will be adding visuals and insights to this overview page but please have a look at the kernels and give it a try yourself ! Here are some ideas for you: **The Holy Grail...** ... is obviously to predict the outcome of the game. The bookies use 3 classes (Home Win, Draw, Away Win). They get it right about 53% of the time. This is also what I\'ve achieved so far using my own SVM. Though it may sound high for such a random sport game, you\'ve got to know that the home team wins about 46% of the time. So the base case (constantly predicting Home Win) has indeed 46% precision. **Probabilities vs Odds** When running a multi-class classifier like SVM you could also output a probability estimate and compare it to the betting odds. Have a look at your variance vs odds and see for what games you had very different predictions. **Explore and visualize features** With access to players and teams attributes, team formations and in-game events you should be able to produce some interesting insights into [The Beautiful Game][6] . Who knows, Guardiola himself may hire one of you some day! [1]: http://football-data.mx-api.enetscores.com/ [2]: http://www.football-data.co.uk/ [3]: http://www.football-data.co.uk/notes.txt [4]: http://sofifa.com/ [5]: https://github.com/hugomathien/football-data-collection/tree/master/footballData [6]: https://en.wikipedia.org/wiki/The_Beautiful_Game
Overwatch is a team-based multiplayer first-person shooter video game developed and published by Blizzard Entertainment. Overwatch puts players into two teams of six, with each player selecting one of several pre-defined hero characters with unique movement, attributes, and abilities; these heroes are divided into four classes: Offense, Defense, Tank and Support. Players on a team work together to secure and defend control points on a map and/or escort a payload across the map in a limited amount of time. Players gain cosmetic rewards that do not affect gameplay, such as character skins and victory poses, as they continue to play in matches. The game was launched with casual play, while Blizzard added competitive ranked play about a month after launch. Additionally, Blizzard has developed and added new characters, maps, and game modes post-release, while stating that all Overwatch updates will remain free, with the only additional cost to players being microtransactions to earn additional cosmetic rewards. ( Wikipedia - https://en.wikipedia.org/wiki/Overwatch_(video_game) ) ![enter image description here][1] [1]: http://images.pushsquare.com/news/2016/12/game_of_the_year_2016_3_-_overwatch/attachment/0/original.jpg
Open clinical trial data provide a valuable opportunity for researchers worldwide to assess new hypotheses, validate published results, and collaborate for scientific advances in medical research. Here, we present a health dataset for the non-invasive detection of cardiovascular disease (CVD), containing 657 data records from 219 subjects. The dataset covers an age range of 20–89 years and records of diseases including hypertension and diabetes. Data acquisition was carried out under the control of standard experimental conditions and specifications. This dataset can be used to carry out the study of photoplethysmograph (PPG) signal quality evaluation and to explore the intrinsic relationship between the PPG waveform and cardiovascular disease to discover and evaluate latent characteristic information contained in PPG signals. These data can also be used to study early and noninvasive screening of common CVD such as hypertension and other related CVD diseases such as diabetes.
The expected sea level rise by the year 2100 will determine an adaptation of the whole coastal system and the land retreat of the shoreline. Future scenarios coupled with the improving of mining technologies will favour an increased exploitation of sand deposits for nourishments, especially for urban beaches and sandy coasts with lowlands behind. Objective of the work is to provide useful tools to support planning actions in the management of sand deposits located in the continental shelf of western Sardinia (western Mediterranean Sea). The work has been realized through the integration of data and information collected during several projects. Available data consist of morpho-bathymetric data (multibeam) associated with morphoacoustic (backscatter) data, collected in the depth range -25 to -700 m. Extensive coverage of high-resolution seismic profiles (Chirp 3.5 kHz) have been acquired along the continental shelf. Also surface sediment samples (Van Veen grab and box corer) and vibrocores have been collected. These data allow mapping of the submerged sand deposits with the determination of their thickness and volumes, and their sedimentological characteristics. Furthermore, it is possible to map the seabed geomorphological features of the continental shelf of western Sardinia. All the available data (doi:10.1594/PANGAEA.895430) have been integrated and organized in a geodatabase implemented through a GIS and the software suite Geoinformation Enabling Toolkit StarterKit ® (GET-IT), developed by researchers of the Italian National Research Council for RITMARE project. GET-IT facilitates the creation of distributed nodes of an interoperable Spatial Data Infrastructure (SDI) and enables unskilled researchers from various scientific domains to create their own Open Geospatial Consortium (OGC) standard services for distributing geospatial data, observations and metadata of sensors and datasets.Data distribution through standard services follows the guidelines of the European Directive INSPIRE (DIRECTIVE 2007/2/EC); in particular, standard metadata describe each map level, containing identifiers such as data type, origin, property, quality, processing processes to foster data searching and quality assessment.
Estimates of selection on juvenile size traits, compiled by Njal Rollinson and Locke Rowe in September 2013. These data are are described as the "J-S Database" in the main text. Both the data included in our formal selection analyses as well as data omitted from formal analyses are included in this dataset.
Context Hey everyone out there! Wikipedia is a publicly available encyclopedia which can be modified by anyone. Some of these modifications are useful whereas some are not. This data set captures all the edits done to English Wikipedia by anyone across the globe. As there are two edits per second, the data which I have collected is for just 20 minutes. Content I have revised the original data set, removed the duplicates and included only the relevant and useful columns. This data set has below mentioned columns: a) action : only edits action is captured. Other actions maybe Talk, etc. b) change_size : the number of characters added or deleted. Positive size means the change was added and negative means the change was deleted. c) geo_ip : This is null if the user is registered in Wikipedia otherwise it is a JSON object containing city, latitude, country_name, region_name and longitude d) is_anonymous : This is a flag/boolean value(true/false) that notifies whether the user is registered or unregistered(anonymous) e) is_bot : This flag/boolean value(true/false) determines if the user is a bot(robot) or a human. f) is_minor: Thus flag/boolean value(true/false) identifies whether the change made to Wikipedia article was minor or major one. g) page_title : This is the title of the Wikipedia article edited by the user. h) url : This field has the URL or link which compares the Wikipedia article before and after the change. i) user : If the user is unregistered, this field will have IP Address either in IPv4 or IPv6 format and if the user is register it will contain the username used when registering on Wikipedia. Acknowledgements I would like to thank hatnote.com from which I could get this data. If you need the original data you may visit www.hatnote.com or directly connect this WebSocket - ws://wikimon.hatnote.com/en/
Context Bitcoin is the longest running and most well known cryptocurrency, first released as open source in 2009 by the anonymous Satoshi Nakamoto. Bitcoin serves as a decentralized medium of digital exchange, with transactions verified and recorded in a public distributed ledger (the blockchain) without the need for a trusted record keeping authority or central intermediary. Transaction blocks contain a SHA-256 cryptographic hash of previous transaction blocks, and are thus "chained" together, serving as an immutable record of all transactions that have ever occurred. As with any currency/commodity on the market, bitcoin trading and financial instruments soon followed public adoption of bitcoin and continue to grow. Included here is historical bitcoin market data at 1-min intervals for select bitcoin exchanges where trading takes place. Happy (data) mining! Content coincheckJPY_1-min_data_2014-10-31_to_2018-06-27.csv bitflyerJPY_1-min_data_2017-07-04_to_2018-06-27.csv coinbaseUSD_1-min_data_2014-12-01_to_2018-06-27.csv bitstampUSD_1-min_data_2012-01-01_to_2018-06-27.csv CSV files for select bitcoin exchanges for the time period of Jan 2012 to July 2018, with minute to minute updates of OHLC (Open, High, Low, Close), Volume in BTC and indicated currency, and weighted bitcoin price. Timestamps are in Unix time. Timestamps without any trades or activity have their data fields forward filled from the last valid time period. If a timestamp is missing, or if there are jumps, this may be because the exchange (or its API) was down, the exchange (or its API) did not exist, or some other unforseen technical error in data reporting or gathering. All effort has been made to deduplicate entries and verify the contents are correct and complete to the best of my ability, but obviously trust at your own risk. Acknowledgements and Inspiration Bitcoin charts for the data. The various exchange APIs, for making it difficult or unintuitive enough to get OHLC and volume data at 1-min intervals that I set out on this data scraping project. Satoshi Nakamoto and the novel core concept of the blockchain, as well as its first execution via the bitcoin protocol. I\'d also like to thank viewers like you! Can\'t wait to see what code or insights you all have to share. I am a lowly Ph.D. student who did this for fun in my meager spare time. If you find this data interesting and you can spare a coffee to fuel my science, send it my way and I\'d be immensely grateful! 1kmWmcQa8qN9ZrdGfdkw8EHKBgugKBRcF
Please refer [here](https://github.com/jp2011/london-crime-data-retriever) for further information.
Context Anonymized data from profiles scraped on LinkedIn. Contains data from about 15000 profiles. Profiles came from people predominantly located in Australia. Includes all their work history as well as analysis of their photo and name. Content Each row contains: * Profile data * Job data * Name analysis (Race, Gender) * Profile picture analysis (Age, Race, Gender, Attractiveness, Health, Emotionality) Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Description Data from [Our World In Data][1] with the mortality rates for causes of death by country and region between 1990 and 201 Acknowledgements Thanks to Our World In Data for collecting this information. [1]: https://ourworldindata.org/
Dataset for prediction of material elastic tensors.Note on citations: If you found this dataset useful and would like to cite it in your work, please be sure to cite its original sources below rather than or in addition to this page.Dataset described in:de Jong M, Chen W, Angsten T, Jain A, Notestine R, Gamst A, Sluiter M, Ande CK, van der Zwaag S, Plata JJ, Toher C, Curtarolo S, Ceder G, Persson KA, Asta M (2015) Charting the complete elastic properties of inorganic crystalline compounds. Scientific Data 2: 150009. https://doi.org/10.1038/sdata.2015.9Data converted from json file available on Dryad (see references 3-4):de Jong M, Chen W, Angsten T, Jain A, Notestine R, Gamst A, Sluiter M, Ande CK, van der Zwaag S, Plata JJ, Toher C, Curtarolo S, Ceder G, Persson KA, Asta M (2015) Data from: Charting the complete elastic properties of inorganic crystalline compounds. Dryad Digital Repository. https://doi.org/10.5061/dryad.h505v
This dataset contains key characteristics about the data described in the Data Descriptor Temporary dense seismic network during the 2016 Central Italy seismic emergency for microzonation studies. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format
A multimodal dataset (visual, auditory, electric median nerve) recorded with a Neuromag Vectorview 306-channel MEG system.
International Financial Statistics (IFS) is a standard source of international statistics on all aspects of international and domestic finance. It reports, for most countries of the world, current data needed in the analysis of problems of international payments and of inflation and deflation, i.e., data on exchange rates, international liquidity, international banking, money and banking, interest rates, prices, production, international transactions, government accounts, and national accounts. Last update in UNdata: 14 May 2010 If you need more current data, the IMF has made their current database available for [bulk download for personal use](http://data.imf.org/?sk=388DFA60-1D26-4ADE-B505-A05A558D9A42). Acknowledgements This dataset was kindly published by the United Nations on the UNData site. You can find [the original dataset here](http://data.un.org/Explorer.aspx). License [Per the UNData terms of use](http://data.un.org/Host.aspx?Content=UNdataUse): all data and metadata provided on UNdata’s website are available free of charge and may be copied freely, duplicated and further distributed provided that [UNdata](http://data.un.org/Explorer.aspx) is cited as the reference.
Context Fashion-MNIST is a dataset of Zalando\'s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. "If it doesn\'t work on MNIST, it won\'t work at all", they said. "Well, if it does work on MNIST, it may still fail on others." Zalando seeks to replace the original MNIST dataset Content Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. The training and test data sets have 785 columns. The first column consists of the class labels (see above), and represents the article of clothing. The rest of the columns contain the pixel-values of the associated image. - To locate a pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27. The pixel is located on row i and column j of a 28 x 28 matrix. - For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below. **Labels** Each training and test example is assigned to one of the following labels: - 0 T-shirt/top - 1 Trouser - 2 Pullover - 3 Dress - 4 Coat - 5 Sandal - 6 Shirt - 7 Sneaker - 8 Bag - 9 Ankle boot TL;DR - Each row is a separate image - Column 1 is the class label. - Remaining columns are pixel numbers (784 total). - Each value is the darkness of the pixel (1 to 255) Acknowledgements - Original dataset was downloaded from [https://github.com/zalandoresearch/fashion-mnist][1] - Dataset was converted to CSV with this script: [https://pjreddie.com/projects/mnist-in-csv/][2] License The MIT License (MIT) Copyright © [2017] Zalando SE, https://tech.zalando.com Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. [1]: https://github.com/zalandoresearch/fashion-mnist [2]: https://pjreddie.com/projects/mnist-in-csv/
Summary of CIBERSORT algorithm analysis with all datasets and PCD samples. Cell types are in column and samples in row. Values are ratios over 1.
The Global Energy Balance Archive (GEBA) is a database for the central storage of the worldwide measured energy fluxes at the Earth\'s surface, maintained at ETH Zurich (Switzerland). This paper documents the status of the GEBA version 2017 dataset, presents the new web interface and user access, and reviews the scientific impact that GEBA data had in various applications. GEBA has continuously been expanded and updated and contains in its 2017 version around 500.000 monthly mean entries of various surface energy balance components measured at 2500 locations. The database contains observations from 15 surface energy flux components, with the most widely measured quantity available in GEBA being the shortwave radiation incident at the Earth\'s surface (global radiation). Many of the historic records extend over several decades. GEBA contains monthly data from a variety of sources, namely from the World Radiation Data Centre (WRDC) in St. Petersburg, from national weather services, from different research networks (BSRN, ARM, SURFRAD), from peer-reviewed publications, project and data reports, and from personal communications. Quality checks are applied to test for gross errors in the dataset. GEBA has played a key role in various research applications, such as in the quantification of the global energy balance, in the discussion of the anomalous atmospheric shortwave absorption, and in the detection of multi-decadal variations in global radiation, known as "global dimming" and "brightening". GEBA is further extensively used for the evaluation of climate models and satellite-derived surface flux products. On a more applied level, GEBA provides the basis for engineering applications in the context of solar power generation, water management, agricultural production and tourism. GEBA is publicly accessible through the internet via http://www.geba.ethz.ch.
Mean body sizes, approximated from the natural logarithm of the lower first or second molar area, and the proposed evolutionary relationships between mammalian genera from the late Clarkforkian and earliest Wasatchian (Cf3 to Wa0) of the Bighorn and Clarks Fork Basins, Wyoming, USA. For details of the dataset see caption for electronic supplementary material, dataset S1.
classical iris dataset
Content Big collection of quotes with their authors, category, tags and popularity(from 0 to 1)
Job Posts dataset The dataset consists of 19,000 job postings that were posted through the Armenian human resource portal CareerCenter. The data was extracted from the Yahoo! mailing group https://groups.yahoo.com/neo/groups/careercenter-am. This was the only online human resource portal in the early 2000s. A job posting usually has some structure, although some fields of the posting are not necessarily filled out by the client (poster). The data was cleaned by removing posts that were not job related or had no structure. The data consists of job posts from 2004-2015 Content jobpost – The original job post date – Date it was posted in the group Title – Job title Company - employer AnnouncementCode – Announcement code (some internal code, is usually missing) Term – Full-Time, Part-time, etc Eligibility -- Eligibility of the candidates Audience --- Who can apply? StartDate – Start date of work Duration - Duration of the employment Location – Employment location JobDescription – Job Description JobRequirment - Job requirements RequiredQual -Required Qualification Salary - Salary ApplicationP – Application Procedure OpeningDate – Opening date of the job announcement Deadline – Deadline for the job announcement Notes - Additional Notes AboutC - About the company Attach - Attachments Year - Year of the announcement (derived from the field date) Month - Month of the announcement (derived from the field date) IT – TRUE if the job is an IT job. This variable is created by a simple search of IT job titles within column “Title” Acknowledgements The data collection and initial research was funded by the American University of Armenia’s research grant (2015). Inspiration The online job market is a good indicator of overall demand for labor in the local economy. In addition, online job postings data are easier and quicker to collect, and they can be a richer source of information than more traditional job postings, such as those found in printed newspapers. The data can be used in the following ways: -Understand the demand for certain professions, job titles, or industries -Help universities with curriculum development -Identify skills that are most frequently required by employers, and how the distribution of necessary skills changes over time -Make recommendations to job seekers and employers Past research We have used association rules mining and simple text mining techniques to analyze the data. Some results can be found here (https://www.slideshare.net/HabetMadoyan/it-skills-analysis-63686238).
Missing Persons India ============== Taken from the pdf available at [National Crime Records Bureau](http://ncrb.nic.in/MissingUidb/20170821-Missing%20Person%20Report.pdf). Since the original was a PDF this is a table extracted from the original PDF using scripts. Some level of noise is present in the data partly due to the original source and partly due to the extraction scripts.
Content Nigerian dishes from Yoruba ethnic. It contains 6 classifiers of food(amala,eba,efo,ewedu,fufu and iyan)
This is a CSV file containing the data behind the Altmetric Top 100 for 2017.A second dataset pulls out the authors and institutions:https://figshare.com/articles/altmetric_top_100_authors_2017_csv/5683963
Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Datasets used in this study
This work presents the Guadalfeo Monitoring Network in Sierra Nevada (Spain), a snow monitoring network in the Guadalfeo Experimental Catchment, a semiarid area in southern Europe representative of snow packs with highly variable dynamics on both the annual and seasonal scales, and significant topographic gradients. The network includes weather stations that cover the high mountain area in the catchment and time-lapse cameras to capture the variability of the ablation phases on different spatial scales. This dataset consists of snow cover maps of the time-lapse camera C1 of the Guadalfeo Monitoring Network, at 10m x 10m spatial resolution and for those days when Landsat satellites overpasses the area (Pimentel et al., 2017).
Many women who are initially thought to have angina turn out to have normal coronary angiograms, that is they are found not to have angina after all. A study was carried out to assess the feasibility of a preliminary screening test. For a large number of patients who were thought to have angina, information on a number of possible risk factors was collected and then their subsequent angina status was recorded. The data is available as an R data frame entitled angina and contains the following information: status: whether woman turns out to have angina (yes/no) age: age of a \u200bwoman smoke: smoking status (1=current-, 2=ex-, 3=non-smoker) cig: current average number of cigarettes per day hyper: hypertension (1=absent, 2=mild, 3=moderate) angfam: family history of angina (yes/no) myofam: family history of myocardial infarction (yes/no) strokefam: family history of stroke (yes/no) diabetes: does woman have diabetes? (yes/no) Missing values are coded as NA. The main aim of this study was to try to find out which, if any, of the health variables, are associated with angina and whether some subset of them could be used to help predict the dependent variable angina status. The accompanying document on the `Model selection through backward elimination’ is going to be useful for that purpose. More specifically, it would be helpful to be able to estimate the risk/probability that a woman with a particular combination of these health variables truly has angina. If such a scheme of estimating risks can be constructed, is it likely to be useful? i.e. is it good at predicting whether a woman has angina or not (since the treatment of angina is expensive)? In addition, it would be of interest to estimate the individual effects of important variables. For example, if smoking seems to be a risk factor, then what is the odds of a smoker having angina relative to a non-smoker? What about ex-smokers and light smokers?
Raw data for the Floating Island Lake pollen dataset obtained from the Neotoma Paleoecological Database.
FROM-GLC-Hierarchy (Yu et al., 2014) is land cover dataset with multi-resolution (i.e. 30 m, 250 m, 500 m, 1 km, 5 km, 10 km, 25 km, 50 km, 100 km) to meet requirements for different resolutions from different applications. The 30 m base map was improved from FROM-GLC-agg with additional coarse resolution datasets (i.e., MCD12Q1 (Friedl et al., 2010), GlobCover2009 (Bontemps et al., 2010) etc.) to reduce land cover type confusion. Around 1.1% pixels were replaced by coarse resolution products. Validation based assessments indicate the accuracy for land cover maps at 30 m, 250 m, 500 m, 1 km resolutions are 69.50%, 76.65%, 74.65%, and 73.47%, respectively. Further analysis of area-estimation biases for different land cover types at different resolutions suggests that maps at coarser than 5 km resolution contain at least 5% area estimation error for most land cover types. Proportion layers, which contain precise information on land cover percentage, are suggested for use when coarser resolution land cover data are required.Please refer to the classification system at http://data.ess.tsinghua.edu.cn/.
This is the raw data from a manuscript entitled "Flow velocity and nutrients affect CO<sub>2</sub> emissions from agricultural drainage channels". Controlled field mesocosms were applied for mimicking agricultural drainage channels and the fate of the initial dissolved inorganic carbon. The dataset has a series of water parameters over time in each mesocosm, including water velocity, water depth, water temperature (WT), EC, dissolved oxygen (DO), dissolved inorganic carbon (DIC), nitrate, CO<sub>2</sub> flux, and CH<sub>4</sub> flux.
Context JeuxVideo.com, is a French website specialized in video games since 1997. It is built as an information tool for players by a team of writers and offers news, files, video game tests or video presentations. Jeuxvideo.com is the most popular French site on video game news. The site's attendance record dates from E3 2013, on June 11, 2013, with a peak of 33 million hits on its pages. Content The dataset covers over 700 video game on JeuxVideo.com Acknowledgements Data was scraped from jeuxvideo.com .
Context SPECint2006 Rate Results for Intel Xeon Scalable Processors Content Data collected on October 30, 2017 Acknowledgements - [The Standard Performance Evaluation Corporation (SPEC)][1] - [Intel ARK][2] - Photo by Samuel Zeller on Unsplash Inspiration Intel introduced new processor names: Platinum, Gold, Silver and Bronze. It would be nice to visualise difference between them. [1]: https://www.spec.org/ [2]: https://ark.intel.com/
Context This dataset was given to me to resolve a task during my master's program. The idea was to separate three batches of fish being the most unrelated possible using genetic algorithms Content Is a matrix in which every row and column represent a fish, so every cell represents the relation between the fish of that column and the fish in of that row Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Content More details about each file are in the individual file descriptions. Context This is a dataset hosted by the City of Seattle. The city has an open data platform found [here](https://data.seattle.gov/) and they update their information according the amount of data that is brought in. Explore the City of Seattle using Kaggle and all of the data sources available through the City of Seattle [organization page](https://www.kaggle.com/city-of-seattle)! * Update Frequency: This dataset is updated monthly. Acknowledgements This dataset is maintained using Socrata's API and Kaggle's API. [Socrata](https://socrata.com/) has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public. This dataset is distributed under the following licenses: Open Data Commons Public Domain Dedication and License
This dataset (Megapool) contains background subtracted frequencies of CD4 and CD8 T cells producing a combination of IFNγ, IL-2 and TNF in response to stimulation by Megapool for Mtb-infected but healthy individuals (Lindestam Arlehamn, 2016). See referenced paper for details. See Column names - Megapool.docx for meanings of column names in Megapool.xlsx.
This is the raw dataset. This file contains the sampling data (date and location), the aphids analysed, the parasitism rate and the parasitoid detected with the molecular tool. This file was used to perform the statistical tests (with R) and to build figure 1.
Loans data
Dahurian larch (Larix gmelinii Rupr.) is the dominant species in northeast China, which situated in the southernmost part of the global boreal forest biome and undergoing the greatest climatically induced changes. Published studies (1965-2015) on tree aboveground growth of Larix gmelinii forests in northeast China were collected in this study, critically reviewed, and a comprehensive growth dataset was developed from 122 sites, which distributed between 40.85° N and 53.47° N in latitude, between 118.20° E and 133.70° E in longitude, between 130 m and 1260 m in altitude. The dataset was composed of 743 entries, including growth data (mean tree height, mean DBH, mean tree volume and/or stand volume) and the associated information, i.e., geographical location (latitude, longitude, altitude, aspect and slope), climate (mean annual temperature (MAT) and mean annual precipitation (MAP)), stand description (origin, stand age, stand density and canopy density), and sample regime (observing year, plot area and number). It would provide quantitative references for plantation management practices and boreal forest growth prediction under future climate change.
This dataset shows maps of the sediment properties and physical environment of the seabed on the northwest European Continental Shelf. Mapped products are: mud, sand and gravel percentages; rock cover; whole-sediment, sand- and gravel-fraction median grain sizes; porosity and permeability; carbon and nitrogen content of sediments; mean and maximum depth-averaged tidal velocity and wave orbital-velocity; monthly natural disturbance rates. Data products are produced at a spatial resolution of 0.125 by 0.125 degrees. [Please note that a previous pre-peer review version of this dataset exists: (http://dx.doi.org/10.15129/07bc686e-a354-40de-8c08-372ced7aad64]
Copyright information:Taken from "Unequal evolutionary conservation of human protein interactions in interologous networks"http://genomebiology.com/2007/8/5/R95Genome Biology 2007;8(5):R95-R95.Published online 29 May 2007PMCID:PMC1929159. Co-expression of yeast \'high confidence\' protein interactions (solid lines) and random protein pairs (dotted lines) using two microarray datasets. This network is enriched in stable complexes, represented by a high mean correlation. Co-expression of the yeast \'kinome\' [31], which is enriched for transient interactions. This type of interaction shows co-expression that is highly similar to the random distribution (dotted lines). Distribution of clustering coefficients in stable and transient PPI networks. Complexes are represented by a high C(blue line), while the sparsely connected transient network is typified by a low C(green line). The properties of the human interaction network. The clustering coefficients indicate that this network is more sparsely connected, with few protein complexes. The co-expression profile is only slightly higher than the randomly generated distribution, suggesting the presence of many transient PPIs.
Context So I was trying to use a VGG19 pretrained model with Keras but the Docker instance couldn\'t download the model file. There\'s an open ticket for this issue here: https://github.com/Kaggle/docker-python/issues/73 Content Just starting off with VGG16 and VGG19 for now. If this works, I\'ll upload some more. The weights for the full vgg16 and vgg19 files were too large to upload as a single files. I tried uploading them in parts but there wasn\'t enough room to extract them in the working directory. Here\'s an example on how to use the model files: > keras_models_dir = "../input/keras-models" > > model = applications.VGG16(include_top=False, weights=None) > model.load_weights(\'%s/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5\' > % keras_models_dir) Here\'s some more examples on how to use it: https://www.kaggle.com/ekkus93/keras-models-as-datasets-test Acknowledgements I downloaded the files from here: https://github.com/fchollet/deep-learning-models Inspiration I just wanted try out something with the Dogs vs Cats dataset and VGG19.
Resting-state fMRI (rsfMRI) data generates time courses with unpredictable hills and valleys. People with musical training may notice that, to some degree, it resemble the notes of a musical scale. Taking advantage of these similarities, and using only rsfMRI data as input, we use basic rules of music theory to transform the data into musical form. Our project is implemented in Python using the midiutil library. We used open rsfMRI from the ABIDE dataset preprocessed by the Preprocessed Connectomes Project. We randomly chose 10 individual datasets preprocessed using C-PAC pipeline with 4 different strategies. To reduce the data dimensionality, we used the CC200 atlas to downsample voxels to 200 regions-of-interest. A framework for generating music from fMRI data, based on music theory, was developed and implemented as a Python tool yielding several audio files. When listening to the results, we noticed that music differed across individual datasets. However, music generated by the same individual (4 preprocessing strategies) remained similar. Our results sound different from music obtained in a similar study using EEG and fMRI data.
About the Dataset This dataset contains the complete datacube collected from the Pavia University, Italy. It is an old dataset where Hyperspectral data measures can be implemented on.
Datasets of four experiments testing whether subitizing in the periphery can be crowded by nearby flankers.
Please note: Please start using ds633.0 to access RDA maintained ERA-5 data, see ERA5 Reanalysis (0.25 Degree Latitude-Longitude Grid) [https://rda.ucar.edu/datasets/ds633.0], RDA dataset ds633.0. This dataset is no longer being updated, and web access will be removed on October 1, 2019.After many years of research and technical preparation, the production of a new ECMWF climate reanalysis to replace ERA-Interim is in progress. ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, which started with the FGGE reanalyses produced in the 1980s, followed by ERA-15, ERA-40 and most recently ERA-Interim. ERA5 will cover the period January 1950 to near real time, though the first segment of data to be released will span the period 2010-2016.ERA5 is produced using high-resolution forecasts (HRES) at 31 kilometer resolution (one fourth the spatial resolution of the operational model) and a 62 kilometer resolution ten member 4D-Var ensemble of data assimilation (EDA) in CY41r2 of ECMWF's Integrated Forecast System (IFS) with 137 hybrid sigma-pressure (model) levels in the vertical, up to a top level of 0.01 hPa. Atmospheric data on these levels are interpolated to 37 pressure levels (the same levels as in ERA-Interim). Surface or single level data are also available, containing 2D parameters such as precipitation, 2 meter temperature, top of atmosphere radiation and vertical integrals over the entire atmosphere. The IFS is coupled to a soil model, the parameters of which are also designated as surface parameters, and an ocean wave model. Generally, the data is available at an hourly frequency and consists of analyses and short (18 hour) forecasts, initialized twice daily from analyses at 06 and 18 UTC. Most analyses parameters are also available from the forecasts. There are a number of forecast parameters, e.g. mean rates and accumulations, that are not available from the analyses.Improvements to ERA5, compared to ERA-Interim, include use of HadISST.2, reprocessed ECMWF climate data records (CDR), and implementation of RTTOV11 radiative transfer. Variational bias corrections have not only been applied to satellite radiances, but also ozone retrievals, aircraft observations, surface pressure, and radiosonde profiles.NCAR's Data Support Section (DSS) is performing and supplying a grid transformed version of ERA5, in which variables originally represented as spectral coefficients or archived on a reduced Gaussian grid are transformed to a regular 1280 longitude by 640 latitude N320 Gaussian grid. In addition, DSS is also computing horizontal winds (u-component, v-component) from spectral vorticity and divergence where these are available. Finally, the data is reprocessed into single parameter time series.Please note: As of November 2017, DSS is also producing a CF 1.6 compliant netCDF-4/HDF5 version of ERA5 for CISL RDA at NCAR. The netCDF-4/HDF5 version is the de facto RDA ERA5 online data format. The GRIB1 data format is only available via NCAR's High Performance Storage System (HPSS). We encourage users to evaluate the netCDF-4/HDF5 version for their work, and to use the currently existing GRIB1 files as a reference and basis of comparison. To ease this transition, there is a one-to-one correspondence between the netCDF-4/HDF5 and GRIB1 files, with as much GRIB1 metadata as possible incorporated into the attributes of the netCDF-4/HDF5 counterpart.
Dataset file
Context -------- Game of Thrones is a hit fantasy tv show based on the equally famous book series "A Song of Fire and Ice" by George RR Martin. The show is well known for its vastly complicated political landscape, large number of characters, and its frequent character deaths. Content ------------ Of course, it goes without saying that this dataset contains spoilers ;) This dataset combines three sources of data, all of which are based on information from the book series. - Firstly, there is **battles.csv** which contains Chris Albon\'s "The War of the Five Kings" Dataset. Its a great collection of all of the battles in the series. - Secondly we have **character-deaths.csv** from Erin Pierce and Ben Kahle. This dataset was created as a part of their Bayesian Survival Analysis. - Finally we have a more comprehensive character dataset with **character-predictions.csv**. It includes their predictions on which character will die. Acknowledgements ------------ - Firstly, there is **battles.csv** which contains Chris Albon\'s "The War of the Five Kings" Dataset, which can be found here: https://github.com/chrisalbon/war_of_the_five_kings_dataset . Its a great collection of all of the battles in the series. - Secondly we have **character-deaths.csv** from Erin Pierce and Ben Kahle. This dataset was created as a part of their Bayesian Survival Analysis which can be found here: http://allendowney.blogspot.com/2015/03/bayesian-survival-analysis-for-game-of.html - Finally we have a more comprehensive character dataset with **character-predictions.csv**. This comes from the team at A Song of Ice and Data who scraped it from http://awoiaf.westeros.org/ . It also includes their predictions on which character will die, the methodology of which can be found here: https://got.show/machine-learning-algorithm-predicts-death-game-of-thrones Inspiration ------------ What insights about the complicated political landscape of this fantasy world can you find in this data?
Iris data set head
SNP datasets for each species and summary table of SNP information. Note that Lithobates sphenocephalus (Lsp) = Rana sphenocephala.
This dataset includes the data for our third analysis, which examined fecal glucocorticoids (fGC), fecal estrogens (fE), and fecal progestogens (fP) during PPA as a function of the number of months to sexual cycle resumption, or during the various phases of the sexual cycles (early or late follicular and luteal) as a function of the number of cycles to conception. For fGC and fE we had a total of 5,470 hormone samples collected between 2000 and 2014 for 138 females during 549 IBIs. Of these samples, 4,150 fecal samples were collected during PPA, and 1,320 were collected during cycling. Each female contributed fecal samples from an average of 4 IBIs (range: 1-11) with an average of 10 hormone samples per IBI (range: 1-55) and an average total number of 40 samples per female (range: 1-145). For fP, we had a smaller sample size of 4,095 hormone samples collected between 2003 and 2014 for 130 females during 456 IBIs; 2,922 were collected during PPA and 1,173 during cycling. Each female contributed samples from an average of 3.5 IBIs (range 1-8), with an average of 9 (range 1-47) samples per IBI, for an average total number of 32 (range 1-141) samples per female.
dataset for test P-HDBF
Modern dataset used for the R code "Indo-Pac_figs.R".
This dataset contains key characteristics about the data described in the Data Descriptor De novo transcriptomes of 14 gammarid individuals for proteogenomic analysis of seven taxonomic groups. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format
Knowledge about the coverage and characteristics of glaciers in High Mountain Asia is still incomplete and heterogeneous. However, several applications such as modelling of past or future glacier development, runoff or glacier volume, rely on the existence and accessibility of complete datasets. In particular, precise outlines of glacier extent are required to spatially constrain glacier-specific calculations such as length, area and volume changes or flow velocities. As a contribution to the Randolph Glacier Inventory (RGI) and the Global Land Ice Measurements from Space (GLIMS) glacier database, we have produced a homogeneous inventory of the Pamir and the Karakoram mountain ranges using 28 Landsat TM and ETM+ scenes acquired around the year 2000. We applied a standardized method of automated digital glacier mapping and manual correction using coherence images from ALOS-1 PALSAR-1 as an additional source of information; we then separated the glacier complexes into individual glaciers using drainage divides derived by watershed analysis from the ASTER GDEM2, and separately delineated all debris-covered areas. Assessment of uncertainties was performed for debris-covered and clean-ice glacier parts using the buffer method and independent multiple digitizing of three glaciers representing key challenges such as shadows and debris cover. Indeed, along with seasonal snow at high elevations, shadow and debris cover represent the largest uncertainties in our final dataset. In total, we mapped more than 27'800 glaciers >0.02 km² covering an area of 35'520 ±1948 km² and an elevation range from 2260 m to 8600 m. Regional median glacier elevations vary from 4150 m (Pamir Alai) to almost 5400 m (Karakoram), which is largely due to differences in temperature and precipitation. Supraglacial debris covers an area of 3587 ±662 km², i.e. 10% of the total glacierised area. Larger glaciers have a higher share in debris-covered area (up to >20%), making it an important factor to be considered in subsequent applications.
Emission intensity pro le is estimated by averaging the signal over several pixel rows from the recorded images to get a complementary dataset that yields information similar to optical time of fight (OTOF) measurements except the fact that current measurements provide a convolution of time of fights of all the species that moves within the plume for various energies in SP scheme. While a double peak is visible for energy of irradiation greater than or equal to 200 micro joules with larger plume length, 100 microjoules shows a single peak with low emission counts and low spatial expansion. Also, emission count corresponding to fast peaks is always less when compared to its slowcounterpart in each case. To compare all the graphs, the emission count is normalized using the maximum emission count in all the data sets (which is the DP 100 case).
Content Hotel reviews given by customers.
Near-surface air temperatures were monitored from 2005 to 2010 in a mesoscale network of 230 sites in the foothills of the Rocky Mountains in southwestern Alberta, Canada. The monitoring network covers a range of elevations from 890 to 2880\u202fm above sea level and an area of about 18\u202f000\u202fkm², sampling a variety of topographic settings and surface environments with an average spatial density of one station per 78\u202fkm². This paper presents the multiyear temperature dataset from this study, with minimum, maximum, and mean daily temperature data. In this paper, we describe the quality control and processing methods used to clean and filter the data and assess its accuracy. Overall data coverage for the study period is 91\u202f%. We introduce a weather-system-dependent gap-filling technique to estimate the missing 9\u202f% of data. Monthly and seasonal distributions of minimum, maximum, and mean daily temperature lapse rates are shown for the region.
json file
The results of the membrane feeding experiments performed during the different infectivity surveys are included in this dataset. In addition to data on the feeding experiments, information on the age of study participants is included.
Morphological data of the Lunterse beek dataset
the file contains Standardized multilocus heterozygosity calculated at 37 putatively neutral markers as well as at 6 MHC-linked markers and fitness data of 147 male Alpine ibex of Gran Paradiso National Park (Italy). Please note that fitness data were only available for a subset of the genotyped individuals. The heterozygosity-fitness correlations were therefore based only on 147 out of the 247 individuals in file "genotypes_PNGP". The full dataset was however used for the calculations of diversity measures and also for the Standardized multilocus heterozygosity that was calculated for each individual as the ratio of its heterozygosity to the mean heterozygosity in the population of the loci at which the individual was genotyped (Coltman et al., 1999).
Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Dataset including information on children's diet and household characteristics for over 43,000 households across 27 developing countries from the Demographic and Health Surveys program 2000 - 2013 (https://www.dhsprogram.com/data/available-datasets.cfm). The dataset also includes spatial variables such as distance to forest edge, road and city.
This geodatabase was built to cover several geothermal targets developed by Flint Geothermal in 2012 during a search for high-temperature systems that could be exploited for electric power development. Several of the thermal springs have geochemistry and geothermometry values indicative of high-temperature systems. In addition, the explorationists discovered a very young Climax-style molybdenum porphyry system northeast of Rico, and drilling intersected thermal waters at depth. Datasets include: 1. Structural data collected by Flint Geothermal 2. Point information 3. Mines and prospects from the USGS MRDS dataset 4. Results of reconnaissance shallow (2 meter) temperature surveys 5. Air photo lineaments 6. Areas covered by travertine 7. Groundwater geochemistry 8. Land ownership in the Rico area 9. Georeferenced geologic map of the Rico Quadrangle, by Pratt et al. 10. Various 1:24,000 scale topographic maps
During hot summer days and heat waves, bedrooms can become warm and uncomfortable, affecting the sleep quality. This dataset presents bedroom air temperatures, bedroom window positions and important bedroom characteristics (floor, orientation and roof material) to investigate factors that promote cool bedrooms. The dataset presents air temperatures measured in 20 bedrooms of terraced houses in Amsterdam, the Netherlands, during an extremely hot summer week in 2016. The datasets also presents outdoor temperatures measured over the same time period. The dataset includes time series of the window position of the bedroom windows (open, half-open, closed) to investigate the effect of the window position on the bedroom temperature. The air temperature measurements were carried out with Ibuttons, recording air temperatures with a time interval of 10 minutes. The measurement accuracy of an IButton is 0.5 ° C.In each bedroom, two IButtons were placed, often on bedside tables in the shade, and in bedrooms on the first floor. In most cases this is the top floor of the houses, except for two bedrooms which were on the mezzanine floor. Bedrooms were orientated to the southeast or northwest. The roofing material was bitumen/EPDM, gravel or green. Two IButtons measured the outdoor temperature. These were placed outside in the shade: one in the front yard and one in the back yard of one of the terraced houses.All IButtons registered air temperatures between Tuesday, August 22nd, 2016 and Sunday August 28th, 2016. Each hour, the homeowners recorded whether the bedroom window was closed, half-open or open.
This dataset includes annual urban extent dynamics (1985-2015) in the conterminous United States at a 30m resolution.(1) The dataset is organized by state (in total 49) in the conterminous US. Location of US states can be found in uploaded figure of “US_State.jpg”. Full names and abbreviations of states are provided in the Excel file of “US_StateList.xls”. (2) The format of provided data is GeoTIFF, i.e., the georeferencing information was embedded within the TIFF file. Each dataset was projected to the Albers Equal Area Conic projection, with a spatial resolution of 30m.(3) The legend in GeoTIFF file can be founded in the figure of “Legend.jpg”, and more detailed information about the urbanized year and the pixel value can be found in the file of “Year_Code_Loopup.csv”.
This dataset contains key characteristics about the data described in the Data Descriptor The sequencing and de novo assembly of the Larimichthys crocea genome using PacBio and Hi-C technologies. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format 3. machine readable metadata file in ISA-Tab format (zipped folder)
Context ------------- Most publicly available football (soccer) statistics are limited to aggregated data such as Goals, Shots, Fouls, Cards. When assessing performance or building predictive models, this simple aggregation, without any context, can be misleading. For example, a team that produced 10 shots on target from long range has a lower chance of scoring than a club that produced the same amount of shots from inside the box. However, metrics derived from this simple count of shots will similarly asses the two teams. A football game generates much more events and it is very important and interesting to take into account the context in which those events were generated. This dataset should keep sports analytics enthusiasts awake for long hours as the number of questions that can be asked is huge. Content ------- This dataset is a result of a very tiresome effort of webscraping and integrating different data sources. The central element is the text commentary. All the events were derived by reverse engineering the text commentary, using regex. Using this, I was able to derive 11 types of events, as well as the main player and secondary player involved in those events and many other statistics. In case I've missed extracting some useful information, you are gladly invited to do so and share your findings. The dataset provides a granular view of 9,074 games, totaling 941,009 events from the biggest 5 European football (soccer) leagues: England, Spain, Germany, Italy, France from 2011/2012 season to 2016/2017 season as of 25.01.2017. There are games that have been played during these seasons for which I could not collect detailed data. Overall, over 90% of the played games during these seasons have event data. The dataset is organized in 3 files: - **events.csv** contains event data about each game. Text commentary was scraped from: bbc.com, espn.com and onefootball.com - **ginf.csv** - contains metadata and market odds about each game. odds were collected from oddsportal.com - **dictionary.txt** contains a dictionary with the textual description of each categorical variable coded with integers Past Research ------------- I have used this data to: - create predictive models for football games in order to bet on football outcomes. - make visualizations about upcoming games - build expected goals models and compare players Inspiration ----------- There are tons of interesting questions a sports enthusiast can answer with this dataset. For example: - What is the value of a shot? Or what is the probability of a shot being a goal given it's location, shooter, league, assist method, gamestate, number of players on the pitch, time - known as expected goals (xG) models - When are teams more likely to score? - Which teams are the best or sloppiest at holding the lead? - Which teams or players make the best use of set pieces? - In which leagues is the referee more likely to give a card? - How do players compare when they shoot with their week foot versus strong foot? Or which players are ambidextrous? - Identify different styles of plays (shooting from long range vs shooting from the box, crossing the ball vs passing the ball, use of headers) - Which teams have a bias for attacking on a particular flank? And many many more...
Context This is a fictitious dataset created for the Data Analytics Bootcamp at ILTACON 2018. Content In this dataset, each legal case is a row. We have generated fake data for each case attribute. Acknowledgements Thanks to @Ventrisfox for generating the fake data.
Data Set The labelled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labelled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels. File descriptions - **labeledTrainData -** The labelled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review. - **testData -** The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one. - **unlabeledTrainData -** An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review. - **sampleSubmission -** A comma-delimited sample submission file in the correct format. Data fields - **id -** Unique ID of each review - **sentiment -** Sentiment of the review; 1 for positive reviews and 0 for negative reviews - **review -** Text of the review
The dataset and the corresponding code (Matlab) can be used for recoloring images, thus helping the people with Color Vision Deficiencies (CVD) recognize and communicate color information. Please cite the following paper if you wish to use our dataset and code: Yulun Wang, Duo li, Menghan Hu, Liming Cai, Guangtao Zhai. Non-local Recoloring Algorithm for Color Vision Deficiencies with Naturalness and Detail Preserving (Unpublished paper, will be updated later). If you have any questions, you can send a request to: humenghan89@163.com
Emotion expression is an essential part of human interaction. The same text can hold different meanings when expressed with different emotions. Thus understanding the text alone is not enough for getting the meaning of an utterance. Acted and natural corpora have been used to detect emotions from speech. Many speech databases for different languages including English, German, Chinese, Japanese, Russian, Italian, Swedish and Spanish exist for modeling emotion recognition. Since there is no reported reference of an available Arabic corpus, we decided to collect the first Arabic Natural Audio Dataset (ANAD) to recognize discrete emotions. Embedding an effective emotion detection feature in speech recognition system seems a promising solution for decreasing the obstacles faced by the deaf when communicating with the outside world. There exist several applications that allow the deaf to make and receive phone calls normally, as the hearing-impaired individual can type a message and the person on the other side hears the words spoken, and as they speak, the words are received as text by the deaf individual. However, missing the emotion part still makes these systems not hundred percent reliable. Having an effective speech to text and text to speech system installed in their everyday life starting from a very young age will hopefully replace the human ear. Such systems will aid deaf people to enroll in normal schools at very young age and will help them to adapt better in classrooms and with their classmates. It will help them experience a normal childhood and hence grow up to be able to integrate within the society without external help. Eight videos of live calls between an anchor and a human outside the studio were downloaded from online Arabic talk shows. Each video was then divided into turns: callers and receivers. To label each video, 18 listeners were asked to listen to each video and select whether they perceive a happy, angry or surprised emotion. Silence, laughs and noisy chunks were removed. Every chunk was then automatically divided into 1 sec speech units forming our final corpus composed of 1384 records. Twenty five acoustic features, also known as low-level descriptors, were extracted. These features are: intensity, zero crossing rates, MFCC 1-12 (Mel-frequency cepstral coefficients), F0 (Fundamental frequency) and F0 envelope, probability of voicing and, LSP frequency 0-7. On every feature nineteen statistical functions were applied. The functions are: maximum, minimum, range, absolute position of maximum, absolute position of minimum, arithmetic of mean, Linear Regression1, Linear Regression2, Linear RegressionA, Linear RegressionQ, standard Deviation, kurtosis, skewness, quartiles 1, 2, 3 and, inter-quartile ranges 1-2, 2-3, 1-3. The delta coefficient for every LLD is also computed as an estimate of the first derivative hence leading to a total of 950 features. I would have never reached that far without the help of my supervisors. I warmly thank and appreciate Dr. Rached Zantout, Dr. Lama Hamandi, and Dr. Ziad Osman for their guidance, support and constant supervision.
Presence-absence dataset of small mammals in 164 counties of the Hengduan Mountains
Model parameters and derived summary statistics for simulated dataset
The original form of this dataset is at this page [http://qwone.com/~jason/20Newsgroups/][1] The 20 Newsgroups data set is a collection of approximately 19K newsgroup documents This version is third version that has 18828 documents All files is converted to txt format [1]: http://qwone.com/~jason/20Newsgroups/
Summary of CIBERSORT algorithm analysis with all datasets and PCD samples. Cell types are in column and samples in row. Values are ratios over 1.
Reconstructed slices of a 3D tomographic dataset without ring artifacts suppresion.
datasets contain records of dead people which collected from ssdmf.info
Dataset used to evaluate sample size to estimate genomic kinship.
My dataset contains 10,000 images of Indian vehicle license plates.
A, exon 2 diversity compared to the rest of the coding region (fused exons 1, 3, 4, 5 and 6) in the dataset including the entire coding region (49 sequences). Mean ± sem is shown. B, synonymous and non-synonymous diversity in the coding region in the dataset including the entire coding region. In the short exon 5 (24 bp) half of the alleles have G instead of C at the nucleotide position 22, resulting in high apparent diversity for the whole exon. Mean ± sem is shown. C, sliding window analysis of non-synonymous, synonymous and complex substitutions in the -e2 in the dataset including the complete -e2. Complex stands for complex combinations of non-synonymous and synonymous substitutions in the same codon. The graph illustrates the contribution of these different components in , which is not equal to and (, as calculated here does not take into consideration the capability of the codon to mutate in synonymous and non-synonymous manner).Copyright information:Taken from "Sequence features of locus define putative basis for gene conversion and point mutations"http://www.biomedcentral.com/1471-2164/9/228BMC Genomics 2008;9():228-228.Published online 19 May 2008PMCID:PMC2408603.
Data matrices and results from all RAxML analyses of concatenated datasets described in the study
**SOURCE** Data taken from : **Boehringer Ingelheim** Predict a biological response of molecules from their chemical properties https://www.kaggle.com/c/bioresponse/data
Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Catchment areas of Lake Baikal catchment, projected in UTM Z48, WGS 84. The data were taken from Swiercz, S. (2004). GIS supported characterization of the Baikal region. Diploma Thesis, Free University of Berlin, Germany. The dataset has two components: * Lake Baikal catchment area * Catchment areas of the main tributary rivers to Lake Baikal (Selenga, Barguzin, Upper Angara)
This data repository contains (1) yearly global autotrophic respiration (RA) dataset from 1980 to 2012 with a spatial resolution of 0.5°; (2) original field observations to develop Random Forest (RF) model; (3) main R codes to produce RA database.Model description:The globally gridded RA database was developed by Random Forest (RF) with 449 field observations (see “dataset.csv” in this repository, updated from Bond-Lamberty and Thomson, 2018) using 11 global variables, including gridded temperature, precipitation, diurnal temperature range, potential evapotranspiration, Palmer Drought Severity Index, nitrogen deposition, downward shortwave radiation, soil carbon content, soil nitrogen density, soil water content, land cover. Dataset information:Dataset name: “Respiration_autotrophic_belowgroud_glob_1980_2012_yr_half_dgree.nc”Which means globally belowground autotrophic respiration from 1980 to 2012 with a spatial resolution of 0.5° at a yearly step.Units: g C m<sup>-2</sup> yr<sup>-1</sup>Format: network Common Data Form (netCDF)Spatial coverage: 90S-90N, 180W-180EThe “dataset.csv” file is the field observation from peer review publications combining Global Soil Respiration Database (SRDB v4, Bond-Lamberty and Thomson, 2018), which is publicly available at https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1578. Besides, The database was further updated using observations collected from the China Knowledge Resource Integrated Database (www.cnki.net) up to November 2018 according to the criteria of SRDB. This dataset is provided in format of “.csv”.R codes:10fold_CV_RA.txt: 10-fold CV for RAAnnual_variability_RA.txt: annual variability for global RACMP_RA.txt: comparing RF-RA and Hashimoto2015-RA using CMP approachRa_DD_CC_plot.txt: plotting the comparing results from CMPRA_MAT_MAP_anomaly.txt: plotting and modelling the relationship between temperature/precipitation anomalies and RA RGB_plot.txt: deriving RGB plot to detecting the relative importance of temperature, precipitation and shortwave radiation.
This dataset provides the results of warming incubation of Arctic soils from trough areas of a high-center polygon at the Barrow Environmental Observatory (BEO) in northern Alaska, United States. The organic-rich soil (8-20 cm below ground surface) and the mineral-rich soil (22-45 cm below surface) were separated, and the thawed and homogenized subsamples from each soil were incubated at -2 degrees C or 8 degrees C for 122 days under anoxic conditions (headspace filled with N2). The extracted DOM from soil samples were analyzed by Fourier transform ion cyclotron resonance mass spectrometry coupled with electrospray ionization (ESI-FTICR-MS). Reported analytes include soil water content, dissolved organic carbon, total organic carbon, MS peaks' m/z and intensities, and elemental composition of identified molecular formulas.
Genotype information for each sample is displayed on a single line. Labels for diploid loci are organized as column headers along the top. Dataset includes 369 individuals collected from 10 geographic locations. This file is formatted for input in GenAlEx v.6.502
Context I have developed an online judge for my university named [RUET OJ][1] . I am sharing the server log dataset of RUET OJ Content This dataset has 16008 rows and 4 columns. Columns are IP, Time, URL, Response Status. Acknowledgements This dataset is too small for research . But I hope others people will also share larger dataset for web log as web log dataset is rare here . Inspiration This dataset will inspire other people to share their collected web log dataset . [1]: http://ruetoj.ml
Excel file containing the full dataset of the paper "Sediment Respiration Pulses in Intermittent Rivers and Ephemeral Streams". The first sheet contains a description of the variables. The second sheet contains the data. These data were used together with the R code (Code S1 file) to generate teh results presented in the paper.
Context It has the list of recommendation job listed for individual Content List of suggested job recommendation Acknowledgements Thank you LinkedIn and IFFFT for helping to collect the dataset Inspiration Wish to design an advanced version of recommendation engine
About This Dataset ===== You can use this fonts file to generate some Chinese character. Use this image can train a machine learning model to recognize text. Dataset is updating ===== Tell me if you have other font file or anything related to this topic.
Copy from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en to use on Kaggle Kernel
Movie revenue depends on multiple factors such as cast, budget, film critic review, MPAA rating, release year, etc. Because of these multiple factors there is no analytical formula for predicting how much revenue a movie will generate. However by analyzing revenues generated by previous movies, one can build a model which can help us predict the expected revenue for a movie. Such a prediction could be very useful for the movie studio which will be producing the movie so they can decide on expenses like artist compensations, advertising, promotions, etc. accordingly. Plus investors can predict an expected return-on-investment.
This data has been modified from the RAW data by: The PM10, PM2.5 and PM1 mass concentrations were measured using an environmental dust monitor (Grimm EDM 180-MC, GRIMM Aerosol Technik GmbH & Co. KG) at an interval of 5 min in Kunming.These converted datasets are combined and averaged over 1 hour, and then saved to this file.
Dataset II used for the analysis of primer universality
This dataset contains key characteristics about the data described in the Data Descriptor Time series of heat demand and heat pump efficiency for energy system modeling. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format
Input data for case studies.
Context Mass Shootings in the United States of America (1966-2017) The US has witnessed 398 mass shootings in last 50 years that resulted in 1,996 deaths and 2,488 injured. The latest and the worst mass shooting of October 2, 2017 killed 58 and injured 515 so far. The number of people injured in this attack is more than the number of people injured in all mass shootings of 2015 and 2016 combined. The average number of mass shootings per year is 7 for the last 50 years that would claim 39 lives and 48 injured per year. Content Geography: United States of America Time period: 1966-2017 Unit of analysis: Mass Shooting Attack Dataset: The dataset contains detailed information of 398 mass shootings in the United States of America that killed 1996 and injured 2488 people. Variables: The dataset contains Serial No, Title, Location, Date, Summary, Fatalities, Injured, Total Victims, Mental Health Issue, Race, Gender, and Lat-Long information. Acknowledgements I’ve consulted several public datasets and web pages to compile this data. Some of the major data sources include [Wikipedia][1], [Mother Jones][2], [Stanford][3], [USA Today][4] and other web sources. Inspiration With a broken heart, I like to call the attention of my fellow Kagglers to use Machine Learning and Data Sciences to help me explore these ideas: • How many people got killed and injured per year? • Visualize mass shootings on the U.S map • Is there any correlation between shooter and his/her race, gender • Any correlation with calendar dates? Do we have more deadly days, weeks or months on average • What cities and states are more prone to such attacks • Can you find and combine any other external datasets to enrich the analysis, for example, gun ownership by state • Any other pattern you see that can help in prediction, crowd safety or in-depth analysis of the event • How many shooters have some kind of mental health problem? Can we compare that shooter with general population with same condition Mass Shootings Dataset Ver 3 This is the new Version of Mass Shootings Dataset. I've added eight new variables: 1. Incident Area (where the incident took place), 2. Open/Close Location (Inside a building or open space) 3. Target (possible target audience or company), 4. Cause (Terrorism, Hate Crime, Fun (for no obvious reason etc.) 5. Policeman Killed (how many on duty officers got killed) 6. Age (age of the shooter) 7. Employed (Y/N) 8. Employed at (Employer Name) Age, Employed and Employed at (3 variables) contain shooter details Mass Shootings Dataset Ver 4 Quite a few missing values have been added Mass Shootings Dataset Ver 5 Three more recent mass shootings have been added including the Texas Church shooting of November 5, 2017 I hope it will help create more visualization and extract patterns. Keep Coding! [1]: https://en.wikipedia.org/wiki/Category:Mass_shootings_in_the_United_States_by_year [2]: http://www.motherjones.com/politics/2012/12/mass-shootings-mother-jones-full-data/ [3]: https://library.stanford.edu/projects/mass-shootings-america [4]: http://www.gannett-cdn.com/GDContent/mass-killings/index.htmltitle
This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of [vgchartz.com][1]. Fields include * Rank - Ranking of overall sales * Name - The games name * Platform - Platform of the games release (i.e. PC,PS4, etc.) * Year - Year of the game's release * Genre - Genre of the game * Publisher - Publisher of the game * NA_Sales - Sales in North America (in millions) * EU_Sales - Sales in Europe (in millions) * JP_Sales - Sales in Japan (in millions) * Other_Sales - Sales in the rest of the world (in millions) * Global_Sales - Total worldwide sales. The script to scrape the data is available at https://github.com/GregorUT/vgchartzScrape. It is based on BeautifulSoup using Python. There are 16,598 records. 2 records were dropped due to incomplete information. [1]: http://www.vgchartz.com/
This dataset consists of 5547 breast histology images of size 50 x 50 x 3, curated from [Andrew Janowczyk website][1] and used for a data science tutorial at [Epidemium][2]. The goal is to classify cancerous images (IDC : invasive ductal carcinoma) vs non-IDC images. [1]: http://www.andrewjanowczyk.com/use-case-6-invasive-ductal-carcinoma-idc-segmentation/ [2]: http://www.epidemium.cc/
Any user who accepts the BSRN data release guidelines (http://bsrn.awi.de/data/conditions-of-data-release) may ask Amelie Driemel (Amelie.Driemel@awi.de) to obtain an account to download these datasets. Newer data are available at: https://dataportals.pangaea.de/bsrn/NOTE TO USERS: The best way to view the data is by clicking on "View dataset as HTML", you will then have the possibility to click on the year and station of interest to download the data files. The download format is a .tab file which can be opened with every program which opens txt files. However, if you want to bulk change the file extension there are various ways to do so, e.g. within the command window with the command: ren *.tab *.txt
This is the dataset related to the article publishes in Sensor and Actuator B titiled "Automatic de-noising of close-range hyperspectral images with a wavelength-specific shearlet-based image noise reduction method" The dataset comprised of hyperspectral images of six different tea products acquired in Visible and Near-infrared spectral range. Please find the complete description in file "Description of the database.docx" attached
Here we provide two ArcGIS map packages with georeferenced files on the spatial distribution of sponges and echinoderms in the wider Weddell Sea (Antarctica), which were created in the context of the development of a marine protected area (MPA) in the Weddell Sea.Sponges: The map of interpolated occurrence of sponges is based on quantitative abundance data (Gerdes 2014 a - o) and on semi-quantitative data obtained by W. Arntz (retired; formerly AWI) (see Teschke & Brey 2019a for presence / absence records of the latter dataset). The abundance data were classified to be merged with the semi-quantitative data and an inverse distance weighted method was performed on the united dataset. Areas with very common occurrence of sponges occurred on the shelf near Brunt Ice Shelf along Riiser - Larsen Ice Shelf to Ekstrøm Ice Shelf.Echinoderms: A cluster analysis with species x station datasets of asteroids (Teschke & Brey 2019b), ophiuroids (Teschke & Brey 2019c) and holothurians (Gutt et al. 2014) from the Antarctic Weddell Sea indicated a particular cold-water echinoderm fauna on the Filchner shelf. We approximated this potential habitat by bottom temperature ≤ -1°, based on seawater temperature data from the Finite Element Sea Ice - Ocean Model provided by R. Timmermann (AWI).More information on the spatial analysis is given in working paper WG-EMM-16/03 submitted to the CCAMLR Working Group on Ecosystem Monitoring and Management (available at https://www.ccamlr.org/en/wg-emm-16).
This repository provides the supplementary R code and data to reproduce the experiments in the following paper : "Highly accurate autonomic diagnosis of papillary thyroid carcinomas using a pathway-based personalized machine learning algorithms". These include: 1. The main method function R file 2. The main script R file 3. The datasets for the development/validation cohorts (R data file format) 4. The pathway information (R data file format)
Context The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. This dataset became a typical test case for many statistical classification techniques in machine learning such as support vector machines Content The dataset contains a set of 150 records under 5 attributes - Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species). Acknowledgements This dataset is free and is publicly available at the UCI Machine Learning Repository
This dataset contain files in OPJ formats which can be opened by data processing software ORIGIN, and WFM file format. It is created by Tektronic oscilloscope and it can be open by WAVESTAR FOR OSCILLOSCOPES. This data is for paper "Broadband Amplification of Low-Terahertz Signals Using Axis-Encircling Electrons in a Helically Corrugated Interaction Region" that been published in Physical review letter.
Context There\'s a story behind every dataset and here\'s your opportunity to share yours. Interactive Hand Gesture. Here are color images and as well depth images of hand gestures grouped by their classes. Copyright: Author: Chengyin Liu; Email: destin369y at gmail.com; Year: 2015. Please acknowledge my name if you use this dataset, thank you. Content What\'s inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. RGB-D hand gesture images taken by depth camera. Grouped by classes. Please refer to "class.txt" Used for hand gesture recognition evaluation. Acknowledgements We wouldn\'t be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. This dataset is used for my hand gesture recognition research at National Taiwan University, in the Intelligent Robot Lab under the lead of Prof. Li-Chen Fu. More details about our lab, please visit http://robotlab.csie.ntu.edu.tw/ Inspiration Your data will be in front of the world\'s largest data science community. What questions do you want to see answered?
I used scrapy to gather data on charities in the United States. All data has been retrieved from https://www.charitynavigator.org/
Context The world marathon majors consist of six major city marathons (https://en.wikipedia.org/wiki/World_Marathon_Majors): List of all historic winners can be found via their individual wikipedia pages: - Tokyo (https://en.wikipedia.org/wiki/Tokyo_Marathon) - Boston (https://en.wikipedia.org/wiki/List_of_winners_of_the_Boston_Marathon) - London (https://en.wikipedia.org/wiki/List_of_winners_of_the_London_Marathon) - Berlin (https://en.wikipedia.org/wiki/Berlin_Marathon) - Chicago (https://en.wikipedia.org/wiki/List_of_winners_of_the_Chicago_Marathon) - New York (https://en.wikipedia.org/wiki/List_of_winners_of_the_New_York_City_Marathon) Content Using Wikipedia and Pandas, I've scraped and compiled a list of winners for each race on both the male and female runners and processed into easy to use csv files. Code can be found on github [here]. Acknowledgements All data scraped from Wikipedia so thanks to all contributors. [1]: https://github.com/GJBroughton/World_Marathon_Majors
Context A dataset for 1300 laptop models. Content 1. Company Name 2. Product Name 3. Laptop Type 4. Screen Inches 5. Screen Resolution 6. CPU Model 7. RAM Characteristics 8. Memory 9. GPU Characteristics 10. Operating System 11. Laptop's Weight 12. Laptop's Price
Sokoto Coventry Fingerprint Dataset (SOCOFing) is a biometric fingerprint database designed for academic research purposes. SOCOFing is made up of 6,000 fingerprint images from 600 African subjects and contains unique attributes such as labels for gender, hand and finger name as well as synthetically altered versions with three different levels of alteration for obliteration, central rotation, and z-cut. For a complete formal description and usage policy please refer to the following paper: https://arxiv.org/abs/1807.10609
The original dataset. A cohort of 149 individuals (58 healthy children, 91 pediatric IBD patients). This sample set was processed and analyzed in Amsterdam, the Netherlands, using an ABI PRISM 3130 Genetic Analyzer.The RData object has two attributes: data - holds the count data of OTU abundances. Rows are samples and columns are OTUs; labels - holds a corresponding label for each sample in data.
Use of gamification strategy in Accounting Education in Brazil to improve students' skills. Dataset in Portuguese.
A vast number of clinical disorders may involve changes in brain structure that are correlated with cognitive function and behavior (e.g., depression, schizophrenia, stroke, etc.). Reliably understanding the relationship between specific brain structures and relevant behaviors in worldwide clinical populations could dramatically improve healthcare decisions around the world. For instance, if a reliable relationship between brain structure after stroke and functional motor ability was established, brain imaging could be used to predict prognosis/recovery potential for individual patients. However, high heterogeneity in clinical populations in both individual neuroanatomy and behavioral outcomes make it difficult to develop accurate models of these potentially subtle relationships.Large neuroimaging studies (n>10,000) would provide unprecedented power to successfully relate clinical neuroanatomy changes with behavioral measures. While these sample sizes might be difficult for any one individual to collect, the ENIGMA Center for WorldwideMedicine, Imaging, and Genomics has successfully pioneered meta- and mega-analytic methods to accomplish this task. ENIGMA brings together a global alliance of over 500 international researchers from over 35 countries to pool together neuroimaging data on different disease states in hopes of discovering critical brain-behavior relationships Individual investigators with relevant data run ENIGMA analysis protocols on their own data and send back an output folder containing the analysis results to be combined with data from other sites for a meta-analysis. In this way, large sample sizes can be acquired without the hassle of large-scale data transfers or actual neuroimaging data sharing.A test dataset is available on request; if interested, please email npnl@usc.edu.
Weather data monitoring is ongoing since late 2013 in a network of three sites located in the Campi Flegrei volcanic area, near Naples (Italy) in the framework of the MONICA (Innovative Monitoring of Coastal and Marine Environment) Project. The aim of this activity is to acquire time series to analyze the influence of meteorological factors on geomorphological coastal processes, such as cliff retreat, landslides and beach erosion. The uploaded dataset includes data (temperature, rain, wind, barometric pressure and relative humidity) acquired at the Denza automatic weather station (model DAVIS Vantage Pro2 wireless) during the period Jan. 2014 - Dec. 2018. Automatic data transfer from the weather station to the ISMAR-CNR processing center of Naples is performed by an internet LAN connection.
Comparisons of this dataset to other publicly available datasets. Included in this file are: 1- SGP-biased genes that are also annotated as expressed in SGPs on http://wormbase.org , 2- hmc-biased genes that are also annotated as expressed in hmcs on wormbase.org , 3- SGP enriched genes [18] that are also detected in our study, 4- C. elegans transcription factors in the wTF2.0 dataset [54] that are SGP-biased in our dataset, and 5- expression results for C. elegans homologs of pluripotency factors. (XLSX 152 kb)
Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism works as a vital role in the cloud computing. Thus my protocol is designed to minimize the switching time, improve the resource utilization and also improve the server performance and throughput. This method or protocol is based on scheduling the jobs in the cloud and to solve the drawbacks in the existing protocols. Here we assign the priority to the job which gives better performance to the computer and try my best to minimize the waiting time and switching time. Best effort has been made to manage the scheduling of jobs for solving drawbacks of existing protocols and also improvise the efficiency and throughput of the server.
MNIST data from http://neuralnetworksanddeeplearning.com
This mango transcriptome assembly was derived from pooled leaf, stem, bud, root, floral and fruit tissue. Using normalized cDNA libraries, we generated comprehensive RNA-Seq datasets using the Illumina NextSeq 500 platform. 82198 of mango unigenes were generated and functionally annotated using a combination of <i>de novo</i> transcriptome assembly, redundancy reduction and Basic Local Alignment Search Tool (BLAST) searches to the Universal Protein Resource UniProtKB/Swiss-Prot database.
Natural Speech Dataset
This dataset provides leaf trait, proximate composition, fatty acid profile, phenolic composition, and <i>in vitro </i>true digestibility of <i>Acer pseudoplatanus, Fraxinus excelsior, Salix caprea, </i>and <i>Sorbus aucuparia </i>foliages, from data collected in Trivero (Italy) in 2015
The goal of this project is to improve accessibility of open datasets by curating them. “NiData” aims to provide a common interface for documentation, downloads, and examples to all open neuroimaging datasets, making data usable for experts and non-experts alike. NiData is a Python package that provides a single interface accessing data from a variety of open data sources. The software framework makes it easy to add new data sources, simple to define and to provide access to multiple datasets from a single data source. Software dependencies are managed on a per-dataset basis, allowing downloads and examples to use any public packages without requiring installation of packages required by unused datasets. The interface also allows selective download of data (by subject or type) and caches files locally, allowing easy management of big datasets. We focused on exposing new methods for downloading data from the HCP, supporting access via Amazon S3 and HTTP/XNAT. We were able to provide a downloader that accepts login credentials and downloads files locally. We created an example that interacts with DIPY to produce diffusion imaging results on a single subject from the HCP. We also worked at collecting common data sources, as well as individual datasets stored at each data source, into NiData’s “data sources” wiki page. We incorporated downloads, documentation, and examples from the nilearn package and began discussion of making a more extensible object model.
Voice Gender ---------------- Gender Recognition by Voice and Speech Analysis This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz ([human vocal range][2]). The Dataset The following acoustic properties of each voice are measured and included within the CSV: - **meanfreq**: mean frequency (in kHz) - **sd**: standard deviation of frequency - **median**: median frequency (in kHz) - **Q25**: first quantile (in kHz) - **Q75**: third quantile (in kHz) - **IQR**: interquantile range (in kHz) - **skew**: skewness (see note in specprop description) - **kurt**: kurtosis (see note in specprop description) - **sp.ent**: spectral entropy - **sfm**: spectral flatness - **mode**: mode frequency - **centroid**: frequency centroid (see specprop) - **peakf**: peak frequency (frequency with highest energy) - **meanfun**: average of fundamental frequency measured across acoustic signal - **minfun**: minimum fundamental frequency measured across acoustic signal - **maxfun**: maximum fundamental frequency measured across acoustic signal - **meandom**: average of dominant frequency measured across acoustic signal - **mindom**: minimum of dominant frequency measured across acoustic signal - **maxdom**: maximum of dominant frequency measured across acoustic signal - **dfrange**: range of dominant frequency measured across acoustic signal - **modindx**: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range - **label**: male or female Accuracy Baseline (always predict male) 50% / 50% Logistic Regression 97% / 98% CART 96% / 97% Random Forest 100% / 98% SVM 100% / 99% XGBoost 100% / 99% Research Questions An original analysis of the data-set can be found in the following article: [Identifying the Gender of a Voice using Machine Learning][3] The best model achieves 99% accuracy on the test set. According to a CART model, it appears that looking at the mean fundamental frequency might be enough to accurately classify a voice. However, some male voices use a higher frequency, even though their resonance differs from female voices, and may be incorrectly classified as female. To the human ear, there is apparently more than simple frequency, that determines a voice's gender. Questions - What other features differ between male and female voices? - Can we find a difference in resonance between male and female voices? - Can we identify falsetto from regular voices? (separate data-set likely needed for this) - Are there other interesting features in the data? CART Diagram ![CART model][4] Mean fundamental frequency appears to be an indicator of voice gender, with a threshold of 140hz separating male from female classifications. References [The Harvard-Haskins Database of Regularly-Timed Speech](http://www.nsi.edu/~ani/download.html) [Telecommunications & Signal Processing Laboratory (TSP) Speech Database at McGill University](http://www-mmsp.ece.mcgill.ca/Documents../Downloads/TSPspeech/TSPspeech.pdf), [Home](http://www-mmsp.ece.mcgill.ca/Documents../Data/index.html) [VoxForge Speech Corpus](http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/8kHz_16bit/), [Home](http://www.voxforge.org) [Festvox CMU_ARCTIC Speech Database at Carnegie Mellon University](http://festvox.org/cmu_arctic/) [1]: http://www.primaryobjects.com/2016/06/22/identifying-the-gender-of-a-voice-using-machine-learning [2]: https://en.wikipedia.org/wiki/Voice_frequencyFundamental_frequency [3]: http://www.primaryobjects.com/2016/06/22/identifying-the-gender-of-a-voice-using-machine-learning/ [4]: http://i.imgur.com/Npr2U7O.png
Context I want to create an app that could generate instrumentals of songs that we listen daily. Content I have generated wav files of different notes using garageband. I will use this data to classify musical notes. Acknowledgements I took stanford paper on sheet music from audio files by Jan Dlabal and Richard Wedeen Inspiration Can we even generate midi files of complicated melodies just using wav files of the song.
This dataset was used to conduct our first analysis, which examined the duration of IBIs and their component phases. For this analysis we used 36 years of data (collected between 1977 and 2012) on reproductive states, demographic events, dominance rank, and rainfall for 160 wild-feeding females. Specifically, we had a total of 490 IBIs for 160 females that fit our analysis criteria (see Methods - Data Analysis), with each female contributing an average of 3 IBIs to the dataset (range: 1-10). Note that female identity and pregnancy identity have been anonymized and the ID given were identical across tables.
DM-Authors dataset contains information about 4,906 researchers in the domain of data management. The dataset is a crawling on DBLP in October 2014. For each researcher, demographic attributes (gender, seniority, number of publications and publication rate) and activity attributes (list of venues and keywords that the researcher has contribute to) are provided.
Context This data set is helpful for beginners in R or Python .It is simple data set for analyzing . Content A XXX Training institute offers training programs on various courses on Mechanical,Computers and Electrical domain. Two type of courses are offered i.e ATP(Advance Training Programme - Job oriented - Target is Bachelor of Engineering students) & MTP(Modular Training Programme -Not Job Oriented - Target is those who want to update there skills).Training.csv is enquirer list generated 2016-2017 batch.ATP is of yearly 3 batchs(Jan,July &sep) .MTP is through year .
Content This dataset is all of Hubway's ridership data and station information up to December 2017. License Hubways data license agreement can be found here: https://www.thehubway.com/data-license-agreement
Overview -------- The World of Warcraft Avatar History Dataset is a collection of records that detail information about player characters in the game over time. It includes information about their character level, race, class, location, and social guild. The Kaggle version of this dataset includes only the information from 2008 (and the dataset in general only includes information from the \'Horde\' faction of players in the game from a single game server). - Full Dataset Source and Information: [http://mmnet.iis.sinica.edu.tw/dl/wowah/][1] - Code used to clean the data: [https://github.com/myles-oneill/WoWAH-parser][2] Ideas for Using the Dataset --------------------------- From the perspective of game system designers, players\' behavior is one of the most important factors they must consider when designing game systems. To gain a fundamental understanding of the game play behavior of online gamers, exploring users\' game play time provides a good starting point. This is because the concept of game play time is applicable to all genres of games and it enables us to model the system workload as well as the impact of system and network QoS on users\' behavior. It can even help us predict players\' loyalty to specific games. Open Questions -------------- - Understand user gameplay behavior (game sessions, movement, leveling) - Understand user interactions (guilds) - Predict players unsubscribing from the game based on activity - What are the most popular zones in WoW, what level players tend to inhabit each? Wrath of the Lich King ---------------------- An expansion to World of Warcraft, "Wrath of the Lich King" (Wotlk) was released on November 13, 2008. It introduced new zones for players to go to, a new character class (the death knight), and a new level cap of 80 (up from 70 previously). This event intersects nicely with the dataset and is probably interesting to investigate. Map --- This dataset doesn\'t include a shapefile (if you know of one that exists, let me know!) to show where the zones the dataset talks about are. Here is a list of zones an information from this version of the game, including their recommended levels: http://wowwiki.wikia.com/wiki/Zones_by_level_(original) . **Update (Version 3)**: [dmi3kno][3] has generously put together some supplementary zone information files which have now been included in this dataset. Some notes about the files: *Note that some zone names contain Chinese characters. Unicode names are preserved as a key to the original dataset. What this addition will allow is to understand properties of the zones a bit better - their relative location to each other, competititive properties, type of gameplay and, hopefully, their contribution to character leveling. Location coordinates contain some redundant (and possibly duplicate) records as they are collected from different sources. Working with uncleaned location coordinate data will allow users to demonstrate their data wrangling skills (both working with strings and spatial data).* [1]: http://mmnet.iis.sinica.edu.tw/dl/wowah/ [2]: https://github.com/myles-oneill/WoWAH-parser [3]: https://www.kaggle.com/dmi3kno
This dataset includes global bias-corrected climate model output data from version 1 of NCAR's Community Earth System Model (CESM1) that participated in phase 5 of the Coupled Model Intercomparison Experiment (CMIP5), which supported the Intergovernmental Panel on Climate Change Fifth Assessment Report (IPCC AR5). The dataset contains all the variables needed for the initial and boundary conditions for simulations with the Weather Research and Forecasting model (WRF) or the Model for Prediction Across Scales (MPAS), provided in the Intermediate File Format specific to WRF and MPAS. The data are interpolated to 26 pressure levels and are provided in files at six hourly intervals. The variables have been bias-corrected using the European Centre for Medium-Range Weather Forecasts (ECMWF) Interim Reanalysis (ERA-Interim) fields for 1981-2005, following the method in Bruyere et al. (2014) [http://dx.doi.org/10.1007/s00382-013-2011-6]. Files are available for a 20th Century simulation (1951-2005) and three concomitant Representative Concentration Pathway (RCP) future scenarios (RCP4.5, RCP6.0 and RCP8.5) spanning 2006-2100. NOTE: There are no bias-corrected data for RCP2.6, due to corrupted data caused by a model bug in CESM. Note to Microsoft Windows users: The executable metgrid.exe, which is required to ingest this data into WPS/WRF, is not compatible with Windows and can only be run in a Linux environment. It is recommended, therefore, that this dataset be used in Linux environments only.
These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file. k
Context The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem. Original Dataset The original dataset can be acessed at [https://archive.ics.uci.edu/ml/datasets/abalone][1]. [1]: https://archive.ics.uci.edu/ml/datasets/abalone
Lately, black carbon (BC) has received significant attention due to its climate-warming properties and adverse health effects. Nevertheless, long-term observations in urban areas are scarce, most likely because BC monitoring is not required by environmental legislation. This, however, handicaps the evaluation of air quality models which can be used to assess the effectiveness of policy measures which aim at reducing BC concentrations.Here, we present a new dataset of atmospheric BC measurements from Germany constructed from over six million measurements at over 170 stations. Data covering the period between 1994 and 2014 were collected from twelve German federal states and the federal Environment Agency (UBA), quality checked and harmonized into a database with comprehensive metadata. The final data in original time resolution are available for download (link will follow). Though assembled in a consistent way, the dataset is characterized by differences in (a) measurement methodologies for determining evolved carbon and optical absorption, (b) covered time periods, and (c) temporal resolutions that ranged from half hourly to 6-daily measurements. Usage of this dataset thus requires a careful consideration of these differences.Our analysis focuses on 2009, the year with the largest data coverage obtained with one single methodology, as well as on the relative changes in long-term trends over ten years. Stations are grouped into the following categories: urban background, traffic, industrial, and rural. For 2009, we find that BC concentrations at traffic sites were at least twice as high as at urban background, industrial and rural sites. Weekly cycles are most prominent at traffic stations, however, the presence of differences in concentrations during the week and on weekends at other station types suggests that traffic plays an important role throughout the full network. Generally higher concentrations and weaker weekly cycles during the winter months point towards the influence of other sources such as domestic heating. Regarding the long-term trends, advanced statistical techniques allow us to account for instrumentation changes and to separate seasonal and long-term changes in our dataset. Analysis shows a downward trend in BC at nearly all locations and in all conditions, with a high level of confidence for the period of 2005-2014. In depth analysis indicates that background BC is decreasing slowly, while the occurrences of high concentrations are decreasing more rapidly.In summary, legislation - both in Europe and locally - to reduce particulate emissions and indirectly BC appear to be working, based on this analysis. Human health and climate impacts are likely to be diminished because of the improvements in air quality.
Context The real estate markets, like those in Sydney and Melbourne, present an interesting opportunity for data analysts to analyze and predict where property prices are moving towards. Prediction of property prices is becoming increasingly important and beneficial. Property prices are a good indicator of both the overall market condition and the economic health of a country. Considering the data provided, we are wrangling a large set of property sales records stored in an unknown format and with unknown data quality issues
Actual expenditures for operating funds (General, Special Revenue, Enterprise and Food Services Funds) per student. Student count is enrollment as of October 1. Actual expenditures for operating funds (General, Special Revenue, Enterprise and Food Services Funds) per student. Student count is enrollment as of October 1.
This is the APS dataset with estimated parameters.
Data explosion via the high-throughput and high-resolution technologies leads to a rapid growth of reference resources and the accumulation of considerably trustworthy knowledgebase. Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. However, as datasets continue to grow, the time involved in this simple task may be prohibitive. In large-scale data mining, the vast majority of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible to and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm, which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and large comparisons and rapidly filter unproductive pairwise comparison. We present the two LINCS applications to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach.
This is the amino acid dataset used to infer the phylogeny presented in Figure 1. It contains 15,549 positions and was assembled from 79 proteins encoded in the chloroplast genomes of 63 green algae. The data set is in Phylip format.
List of the genes composing the merged matrix of all transcriptomic data from datasets and PCD samples (n=9939).
Identifiers on the rise in Germany shall treat the current development of identifiers in Germany\'s research landscape. Introducing ORCID in Germany\'s Universities and research organizations finds a lot of interest and a quick uptake. The talks illustrates also the challenges for personal identifiers taking especially German history into account. Within the talk I will present the latest results from a study on the usage and spread of ORCID in academic institutions in Germany in the course of the ORCID DE project. The presentation will also touch upon the discussion about of the recently presented "Kerndatensatz Forschung" (Research Core Dataset) recommended by the German Council of Science and Humanities aiming to gather coherent information about research activities also using authority files such as ORCID, DOI and organization identifiers (http://www.forschungsinfo.de/kerndatensatz/en/index.php?home).
The standard biomedical terminologies ICD-10, ICD-O, TNM, MeSH, NCIt, MedDRA, and SNOMED CT were used in a case study where two dimensions of cancer (anatomy and histology) had already been coded in a dataset using a custom terminology (ROCHE).
Stata dataset of 2,118 Lobbying Disclosure Act reports from 574 organizations active on the 2014 Farm Bill. Data originally collected and coded by the Center for Responsive Politics. Includes name of organization, dollar amount of reported expenses, number of lobbyists, lobbyists with previous government experiences (revolving door), description of issue, sector and industry of organization, and topic codes (created by author).
Dataset used.
Copyright information:Taken from "Transterm—extended search facilities and improved integration with other databases"Nucleic Acids Research 2005;34(Database issue):D37-D40.Published online 28 Dec 2005PMCID:PMC1347521.© The Author 2006. Published by Oxford University Press. All rights reserved Shown is a selection of the type of pre-processed data to view in progress, with the results of a pattern description search from a previous action in the low frame (see also ). The file contents for each type of data have been described previously (). These include redundant and non-redundant 3′- and 5′-flanks, CDS, initiation and termination contexts; consensuses and information content of the initiation and termination contexts; codon usage; list of entries making up the dataset; scientific and short names of the species; an overall summary file.
Dataset for genetic relatedness estimates, using TrioML for degree of genetic relatedness between male and female
Content **This dataset contains following important columns :** 'start' - When the president got elected 'end' - When the term ended 'president' - Who was the president 'prior' - What was he before becoming president 'party' - The supporting party 'vice' - The Vice President during the term
Context I am experimenting on this dataset various Deep Learning architectures. This dataset can be used in classification, localization, segmentation, generative models and face detection/recognition/classification. Content The dataset has 52156 rgb images. Train and test datasets are splitted for each 86 classes with ratio 0.8 . the original images has 1988x3056 dimension. It is reduced to 288x432 using OpenCV. Acknowledgements I download the books from different webpages. In the futures, I can add some new images if it needed. Inspiration Comic books have different images than standard images that are worked on. The characters, images, environments, colors, and more in this data set are much more challenging and confusing than the image data sets that have been worked on before. Besides, the results for the use of GANs are much bigger and more complicated than those that have achieved successful results.
Content BMS - Building Management System is a data set which describes complains (reactives) from people who work in the building. Process: Staff in the hospital reports to facility if for example area (room) is too hot, then engineer will go to check this area and mark task as completed. I want to show average time each month to complete this task. Under category of work there is ** Area Too Hot. I want to select that category of work and to show average time to complete that task and compare with other months. Building - Hospital in London (St Barts Hospital)
associated dataset for Data Descriptor: Data Record 2, Raw Data.xlsx
File descriptions ----------------- - train.csv - the training set - test.csv - the test set - data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here - sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms Data fields ----------- Here's a brief version of what you'll find in the data description file. - SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict. - MSSubClass: The building class - MSZoning: The general zoning classification - LotFrontage: Linear feet of street connected to property - LotArea: Lot size in square feet - Street: Type of road access - Alley: Type of alley access - LotShape: General shape of property - LandContour: Flatness of the property - Utilities: Type of utilities available - LotConfig: Lot configuration - LandSlope: Slope of property - Neighborhood: Physical locations within Ames city limits - Condition1: Proximity to main road or railroad - Condition2: Proximity to main road or railroad (if a second is present) - BldgType: Type of dwelling - HouseStyle: Style of dwelling - OverallQual: Overall material and finish quality - OverallCond: Overall condition rating - YearBuilt: Original construction date - YearRemodAdd: Remodel date - RoofStyle: Type of roof - RoofMatl: Roof material - Exterior1st: Exterior covering on house - Exterior2nd: Exterior covering on house (if more than one material) - MasVnrType: Masonry veneer type - MasVnrArea: Masonry veneer area in square feet - ExterQual: Exterior material quality - ExterCond: Present condition of the material on the exterior - Foundation: Type of foundation - BsmtQual: Height of the basement - BsmtCond: General condition of the basement - BsmtExposure: Walkout or garden level basement walls - BsmtFinType1: Quality of basement finished area - BsmtFinSF1: Type 1 finished square feet - BsmtFinType2: Quality of second finished area (if present) - BsmtFinSF2: Type 2 finished square feet - BsmtUnfSF: Unfinished square feet of basement area - TotalBsmtSF: Total square feet of basement area - Heating: Type of heating - HeatingQC: Heating quality and condition - CentralAir: Central air conditioning - Electrical: Electrical system - 1stFlrSF: First Floor square feet - 2ndFlrSF: Second floor square feet - LowQualFinSF: Low quality finished square feet (all floors) - GrLivArea: Above grade (ground) living area square feet - BsmtFullBath: Basement full bathrooms - BsmtHalfBath: Basement half bathrooms - FullBath: Full bathrooms above grade - HalfBath: Half baths above grade - BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms) - KitchenAbvGr: Kitchens above grade - KitchenQual: Kitchen quality - TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) - Functional: Home functionality rating - Fireplaces: Number of fireplaces - FireplaceQu: Fireplace quality - GarageType: Garage location - GarageYrBlt: Year garage was built - GarageFinish: Interior finish of the garage - GarageCars: Size of garage in car capacity - GarageArea: Size of garage in square feet - GarageQual: Garage quality - GarageCond: Garage condition - PavedDrive: Paved driveway - WoodDeckSF: Wood deck area in square feet - OpenPorchSF: Open porch area in square feet - EnclosedPorch: Enclosed porch area in square feet - 3SsnPorch: Three season porch area in square feet - ScreenPorch: Screen porch area in square feet - PoolArea: Pool area in square feet - PoolQC: Pool quality - Fence: Fence quality - MiscFeature: Miscellaneous feature not covered in other categories - MiscVal: $Value of miscellaneous feature - MoSold: Month Sold - YrSold: Year Sold - SaleType: Type of sale - SaleCondition: Condition of sale Acknowledgments --------------- Using data from: [House Prices: Advanced Regression Techniques][1] 2 attributes corrected from the description: KitchenAbvGr and BedroomAbvGr [1]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
I crawled some of the posts from r/mexico. This dataset considers both the text of the websites submitted to the subrredit as well as the comments posted about them.
Context Gowalla is a location-based social networking website where users share their locations by checking-in. Content Time and location information of check-ins made by users. Acknowledgements This data set is available from https://snap.stanford.edu/data/loc-gowalla.html E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011.
This is a datapacket that contains the input files for all analyses performed in this study as described in the publication. This includes input files for maximum parsimony, maximum likelihood and Bayesian inference analyses of mtSSU; maximum liklihood and Bayesian inference input files for a partitioned ITS dataset; and the distance table of computed ITS Jukes Cantor sequences distances calculated in PAUP.
PRIMAP-crf is a processed version of data reported by countries to the United Nations Framework Convention on Climate Change (UNFCCC) in the Common Reporting Format (CRF). The processing has three key aspects: 1) Data from individual countries and years are combined into one file. 2) Data is re-organised to follow the IPCC 2006 hierarchical categorisation. 3) ‘Baskets’ of gases are calculated according to different global warming potential estimates from each of the three most recent IPCC reports. All Annex I Parties to the United Nations Framework Convention on Climate Change (UNFCCC) are required to report domestic emissions on an annual basis in a 'Common Reporting Format' (CRF). In 2015, the CRF data reporting was updated to follow the more recent 2006 guidelines from the IPCC and the structure of the reporting tables was modified accordingly. However, the hierarchical categorisation of data in the IPCC 2006 guidelines is not readily extracted from the reporting tables. We present the PRIMAP-crf data as a re-constructed hierarchical dataset according to the IPCC 2006 guidelines. Furthermore, the data is organised in a series of tables containing all available countries and years for each GHG individual gas and category reported. In addition to single gases, the Kyoto basket of greenhouse gases (CO2, N2O, CH4, HFCs, PFCs, SF6, and NF3) is provided according to multiple global warming potentials. The dataset was produced using the PRIMAP emissions module. Key processing steps include; extracting data from submitted CRF excel spreadsheets, mapping CRF categories to IPCC 2006 categories, constructing missing categories from available data, and aggregating single gases to gas baskets. The processed data is available under an Creative Commons Attribution 4.0 International License (CC BY 4.0).
Donald Trump Tweets
Context Being a fan of board games, I wanted to see if there was any correlation with a games rating and any particular quality, the first step was to collect of this data. Content The data was collected in March of 2017 from the website https://boardgamegeek.com/, this site has an API to retrieve game information (though sadly XML not JSON). Acknowledgements Mainly I want to thank the people who run the board game geek website for maintaining such a great resource for those of us in the hobby. Inspiration I wish I had some better questions to ask of the data, perhaps somebody else can think of some good ways to get some insight of this dataset.
Information This dataset is a copy of the [Detecting Insults in Social Commentary Challange][1] to run on a kernel. [1]: https://www.kaggle.com/c/detecting-insults-in-social-commentary
Context The data was originally collected from here: https://www.kaggle.com/c/painter-by-numbers but only train_2.zip was available on new Kernels. So the following datasets can be combined to hold all the needed data in 1 new Kernel: * [painter-test](https://www.kaggle.com/mfekadu/painter-test/) * [painters-train-part-1](https://www.kaggle.com/mfekadu/painters-train-part-1/) * [painters-train-part-2](https://www.kaggle.com/mfekadu/painters-train-part-2/) * [painters-train-part-3](https://www.kaggle.com/mfekadu/painters-train-part-3/) This is a messy way to get around Kaggle's 20 GB dataset limit, but you can [**_just fork this notebook_**](https://www.kaggle.com/mfekadu/painter-by-numbers-combined-dataset) to quickly get started. Content test.zip Acknowledgements Thank you [Kiri Nichol](https://www.kaggle.com/smallyellowduck) for collecting the data for the competition. Inspiration Do the pixels that represent a Picasso painting, uniquely identify him?
Content More details about each file are in the individual file descriptions. Context This is a dataset from the [U.S. Census Bureau](http://www.census.gov/) hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found [here](https://fred.stlouisfed.org/) and they update their information according the amount of data that is brought in. Explore the U.S. Census Bureau using Kaggle and all of the data sources available through the U.S. Census Bureau [organization page](https://www.kaggle.com/census)! * Update Frequency: This dataset is updated daily. Acknowledgements This dataset is maintained using FRED's [API](https://research.stlouisfed.org/docs/api/fred/) and Kaggle's [API](https://github.com/Kaggle/kaggle-api).
Context For movie viewers, the movie posters are one of the first impressions which humans use to get cues about the movie content and its genre. Humans can grasp the cues like color, expressions on the faces of actors etc to quickly determine the genre (horror, comedy, animation etc). It has been shown that color characteristics of an image like hues, saturation, brightness, contour etc. affect human emotions. A given situation arouses these emotions in humans. If humans are able to predict genre of a movie by a single glance at its poster, then we can assume that the color characteristics, local texture based features and structural cues of posters possess some characteristics which could be utilized in machine learning algorithms to predict its genre. Content The movie posters are obtained from IMDB website. The collected dataset contains IMDB Id, IMDB Link, Title, IMDB Score, Genre and link to download movie posters. Each Movie poster can belong to at least one genre and can have at most 3 genre labels assigned to it. As the dataset also includes the IMDB score, it would be really interesting to see if movie poster is related to rating. Acknowledgements The IMDB Id for movies were obtained from MovieLens. The IMDB Link, Title, IMDB Score, Genre and link to download movie posters were obtained from IMDB website. Inspiration Does color plays an important role in deciding the genre of the movie? Can raw image pixels contain enough information to predict genre from movie? Does number of faces in the poster say anything about the movie genre? What is the most frequent color used in horror movies? Which features are important to predict animated movie genre? If a movie belong to more than one genre, can we predict them all? Can we use movie posters only to predict movie rating?
This dataset contains data presented in the figures of the paper "Semivolatile POA and parameterized total combustion SOA in CMAQv5.2: impacts on source strength and partitioning" published in Atmospheric Chemistry and Physics. It also links to the data archive of field observations.
Gene matrix from dataset
Context Observations of particles much smaller than us, and various understandings of those particles, have propelled mankind forward in ways once impossible to imagine. "The elements" are what we call the sequential patterns in which some of these particles manifest themselves. As a chemistry student and a coder, I wanted to do what came naturally to me and make my class a bit easier by coding/automating my way around some of the tedious work involved with calculations. Unfortunately, it seems that chemical-related datasets are not yet a thing which have been conveniently formatted into downloadable databases (as far as my research went). I decided that the elements would be a good place to start data collection, so I did that, and I\'d like to see if this is useful to others as well. Other related data sets I\'d like to coalesce are some large amount of standard entropies and enthalpies of various compounds, and many of the data sets from the *CRC Handbook of Chemistry and Physics*. I also think as many diagrams as possible should be documented in a way that can be manipulated and read via code. Content Included here are three data sets. Each data set I have included is in three different formats (CSV, JSON, Excel), for a total of nine files. Table of the Elements: - This is the primary data set. - 118 elements in sequential order - 72 features Reactivity Series: - 33 rows (in order of reactivity - most reactive at the top) - 3 features (symbol, name, ion) Electromotive Potentials: - 284 rows (in order from most negative potential to most positive) - 3 features (oxidant, reductant, potential) Acknowledgements All of the data was scraped from 120 pages on Wikipedia using scripts. The links to those scripts are available in the dataset descriptions. Extra If you are interested in trying the chemistry calculations code I made for completing some of my repetitive class work, it\'s publicly available on [my GitHub][1]. ([Chemistry Calculations Repository][1]) I plan to continue updating that as time goes on. [1]: https://github.com/jwaitze/Chemistry-Calculations
Context The wikipedia dump is a giant XML file and contains loads of not-so-useful content. I needed some english text for some unsupervised learning so I spent quite a bit of time extracting and cleaning up the text. Content Each line of the txt file is a 'sentence'. I put sentence in quote because the content in these files haven't been read all the way through for errors. Here is what I did: - Parsed out the opening text on non-disambiguation and non-table-of-contents pages. - Removed sentences requiring citations, because these were usually poorly formed. - Parse each block of text into sentences using SpaCy. I then checked for bracket and quote correctness, filtering out sentences that didn't quite match up. - Removed sentences shorter than 3 letters and longer than 255 characters. This covers 97% of the data. - Remove duplicate sentences, and, as a byproduct, sorted alphabetically.
Context US Airline passenger satisfaction survey Content "Satisfaction:Airline satisfaction level(Satisfaction, neutral or dissatisfaction)" Age:The actual age of the passengers Gender:Gender of the passengers (Female, Male) "Type of Travel:Purpose of the flight of the passengers (Personal Travel, Business Travel)" "Class:Travel class in the plane of the passengers (Business, Eco, Eco Plus)" Customer Type:The customer type (Loyal customer, disloyal customer) Flight distance:The flight distance of this journey "Inflight wifi service:Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)" Ease of Online booking:Satisfaction level of online booking Inflight service:Satisfaction level of inflight service Online boarding:Satisfaction level of online boarding Inflight entertainment:Satisfaction level of inflight entertainment Food and drink:Satisfaction level of Food and drink Seat comfort:Satisfaction level of Seat comfort On-board service:Satisfaction level of On-board service Leg room service:Satisfaction level of Leg room service Departure/Arrival time convenient:Satisfaction level of Departure/Arrival time convenient Baggage handling:Satisfaction level of baggage handling Gate location:Satisfaction level of Gate location Cleanliness:Satisfaction level of Cleanliness Check-in service:Satisfaction level of Check-in service Departure Delay in Minutes:Minutes delayed when departure Arrival Delay in Minutes:Minutes delayed when Arrival Flight cancelled:Whether the Flight cancelled or not (Yes, No) Flight time in minutes:Minutes of Flight takes
**Tatoeba Sentences Corpus** This data is directly from the Tatoeba project: https://tatoeba.org/ It is a large collection of sentences in multiple languages. Many of the sentences are contained with translations in multiple languages. It is a valuable resource for Machine Translation and many Natural Language Processing projects.
There are 150 normal and 134 nodule thyroid CT images in the dataset.zip. The image format includes PNG and DICOM.
Taxa partition in nexus format for use in SVDq analysis implemented in PAUP 4a150. To be used in conjunction with RAD dataset in nexus format.
The dataset authors have created a terrestrial water budget data archive on a 1-degree grid. They identified 13332 global stations with complete records and created 30-year (1950-1979) climatological means for each month of the year for air temperature, precipitation, evaporation, soil moisture, and snow cover. These monthly climatological means were then interpolated to the 1-degree gridpoints. This data set contains both the gridpoint individual station data.
Content Official addresses assigned in the City of Los Angeles created and maintained by the Bureau of Engineering. Context This is a dataset hosted by the city of Los Angeles. The organization has an open data platform found [here](https://data.lacity.org) and they update their information according the amount of data that is brought in. Explore Los Angeles's Data using Kaggle and all of the data sources available through the city of Los Angeles [organization page](https://www.kaggle.com/cityofLA)! * Update Frequency: This dataset is updated daily. Acknowledgements This dataset is maintained using Socrata's API and Kaggle's API. [Socrata](https://socrata.com/) has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public. [Cover photo](https://unsplash.com/photos/1mPBkYvbu3w) by [Timothy Eberly](https://unsplash.com/@timothyeberly) on [Unsplash](https://unsplash.com/) _Unsplash Images are distributed under a unique [Unsplash License](https://unsplash.com/license)._
References for all studies mentioned in the three datasets for: Johnston, A.S.A & Sibly, R.M. The influence of soil communities on the temperature sensitivity of soil respiration.
Consumer complaints are added to this public database after the company has responded to the complaint, confirming a commercial relationship with the consumer, or after they've had the complaint for 15 calendar days, whichever comes first. We don’t verify all the facts alleged in complaints, but we do give companies the opportunity to publicly respond to complaints by selecting responses from a pre-populated list. Company-level information should be considered in the context of company size and/or market share.
0 present/absent calls. The average frequency of values of empty probesets generated by the MAS 5.0 present/absent algorithm when τ = 0.015 (solid black line) and when τ = 0 (dotted line). The average was taken over the six samples. The percentage of central nucleotides in PM probes for empty probesets with values < 0.06, for all empty probesets (similar percentages are present in all probesets), and for empty probesets with values > 0.94 are shown. values generated with the Wilcoxon signed rank test for random empty probesets. The PM-MM probe-pairs from empty probesets with fewer than six alignment errors to any transcript in the GoldenSpike dataset were randomly re-assembled into probesets based on the central nucleotide (for example, only central T nucleotides in the PM probes). Symbols and lines are colored according to the central nucleotide.Copyright information:Taken from "Correcting for sequence biases in present/absent calls"http://genomebiology.com/2007/8/6/R125Genome Biology 2007;8(6):R125-R125.Published online 26 Jun 2007PMCID:PMC2394774.
MovieLens 1M dataset enriched with IMDB on movie attributes.
Cars Data has Information about 3 brands/make of cars. Namely US, Japan, Europe. Target of the data set to find the brand of a car using the parameters such as horsepower, Cubic inches, Make year, etc. A decision tree can be used create a predictive data model to predict the car brand.
Dataset
Context I love football and wanted to gather a data-set of a list of football players along with their each game performance from various different sources. Content The csv file has the fantasy premier league data of all players who played in 3 seasons and a detailed spreadsheet of each player is provided. Acknowledgements Thanks to TURD from tableau for some of the data. Inspiration We all wondered if it is possible to predict the future! well with the player data against each team and conditions we get to check if the future prediction is truly possible!
Context The **Convention on International Trade in Endangered Species of Wild Fauna and Flora**, or **CITES** for short, is an international treaty organization tasked with monitoring, reporting, and providing recommendations on the international species trade. CITES is a division of the IUCN, which is one of the principal international organization focused on wildlife conversation at large. It is not a part of the UN (though its reports are read closely by the UN). CITES is one of the oldest conservation organizations in existence. Participation in CITES is voluntary, but almost every member nation in the UN (and, therefore, almost every country worldwide) participates. Countries participating in CITES are obligated to report on roughly 5000 animal species and 29000 plant species brought into or exported out of their countries, and to honor limitations placed on the international trade of these species. Protected species are organized into three appendixes. Appendix I species are those whose trade threatens them with extinction. Two particularly famous examples of Class I species are the black rhinoceros and the African elephant, whose extremely valuable tusks are an alluring target for poachers exporting ivory abroad. There are 1200 such species. Appendix II species are those not threatened with extinction, but whose trade is nevertheless detrimental. Most species in cites, around 21000 of them, are in Appendix II. Finally, Appendix III animals are those submitted to CITES by member states as a control mechanism. There are about 170 such species, and their export or import requires permits from the submitting member state(s). This dataset records all <i>legal</i> species imports and exports carried out in 2016 (and a few records from 2017) and reported to CITES. Species not on the CITES lists are not included; nor is the significant, and highly illegal, ongoing black market trading activity. Content This dataset contains records on every international import or export conducted with species from the CITES lists in 2016. It contains columns identifying the species, the import and export countries, and the amount and characteristics of the goods being traded (which range from live animals to skins and cadavers). For further details on individual rows and columns refer to the metadata on the `/data` tab. A much more detailed description of each of the fields is available in the [original CITES documentation](https://trade.cites.org/cites_trade_guidelines/en-CITES_Trade_Database_Guide.pdf). Acknowledgements This dataset was originally aggregated by CITES and made available online through [this downloader tool](https://trade.cites.org/en/cites_trade/). The CITES downloader goes back to 1975, however it is only possible to download fully international data two years at a time (or so) due to limitations in the number of rows allowed by the data exporter. If you would like data going further back, check out the downloader. Be warned, though, this data takes a long time to generate! This data is prepared for CITES by UNEP, a division of the UN, and hence likely covered by the [UN Data License](http://data.un.org/Host.aspx?Content=UNdataUse). Inspiration * What is the geospatial distribution of the international plant/animal trade? * How much export/import activity is there for well-known species, like rhinos, elephants, etcetera? * What percent of the trade is live, as opposed to animal products (ivory, skins, cadavers, etcetera)?
Context The [Sentiment Polarity Dataset Version 2.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/) is created by Bo Pang and Lillian Lee. This dataset is redistributed with NLTK with permission from the authors. This corpus is also used in the [**Document Classification** section of Chapter 6.1.3 of the NLTK book](http://www.nltk.org/book/ch06.html). Content This dataset contains 1000 positive and 1000 negative processed reviews. Citation Bo Pang and Lillian Lee. 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. In ACL. Bibtex: @InProceedings{Pang+Lee:04a, author = {Bo Pang and Lillian Lee}, title = {A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts}, booktitle = "Proceedings of the ACL", year = 2004 }
Dataset connected with article: 'Increasing the use of conceptually-derived strategies in arithmetic: using inversion problems to promote the use of associativity shortcuts.' Abstract: Conceptual knowledge of key principles underlying arithmetic is an important precursor to understanding algebra and later success in mathematics. One such principle is associativity, which allows individuals to solve problems in different ways by decomposing and recombining subexpressions (e.g. ‘a + b – c’ = ‘b – c + a’). More than any other principle, children and adults alike have difficulty understanding it, and educators have called for this to change. We report three intervention studies that were conducted in university classrooms to investigate whether adults’ use of associativity could be improved. In all three studies, it was found that those who first solved inversion problems (e.g. ‘a + b – b’) were more likely than controls to then use associativity on ‘a + b – c’ problems. We suggest that ‘a + b – b’ inversion problems may either direct spatial attention to the location of ‘b – c’ on associativity problems, or implicitly communicate the validity and efficiency of a right-to-left strategy. These findings may be helpful for those designing brief activities that aim to aid the understanding of arithmetic principles and algebra.
Number of papers per category for ten key-entropy concepts. The concepts were selected according with their frequency of appearances in all abstracts in our dataset.Interactive plot: https://plot.ly/~larckov/1992.embed
Raw data for the 16/1 pollen surface sample dataset obtained from the Neotoma Paleoecological Database.
Introduction A car company has the data for all the cars that are present in the market. They are planning to introduce some new ones of their own, but first, they want to find out what would be the popularity of the new cars in the market based on each car's attributes. We will provide you a dataset of cars along with the attributes of each car along with its popularity. Your task is to train a model that can predict the popularity of new cars based on the given attributes. Dataset You are given a training dataset, train.csv. The file is a comma separated file with useful information for this task: train.csv contains the information about a car along with its popularity level. Each row provides information on each car. Information such as buying_price, maintenance_cost, number_of_doors, number_of_seats, etc. The definition of each attribute is as follows: buying_price: The buying_price denotes the buying price of the car, and it ranges from [1...4], where buying_price equal to 1 represents the lowest price while buying_price equal to 4 represents the highest price. maintenance_cost: The maintenance_cost denotes the maintenance cost of the car, and it ranges from [1...4], where maintenance_cost equal to 1 represents the lowest cost while maintenance_cost equal to 4 represents the highest cost. number_of_doors: The number_of_doors denotes the number of doors in the car, and it ranges from [2...5], where each value of number_of_doors represents the number of doors in the car. number_of_seats: The number_of_seats denotes the number of seats in the car, and it consists of [2, 4, 5], where each value of number_of_seats represents the number of seats in the car. luggage_boot_size: The luggage_boot_size denotes the luggage boot size, and it ranges from [1...3], where luggage_boot_size equal to 1 represents smallest luggage boot size while luggage_boot_size equal to 3 represents largest luggage boot size. safety_rating: The safety_rating denotes the safety rating of the car, and it ranges from [1...3], where safety_rating equal to 1 represents low safety while safety_rating equal to 3 represents high safety. popularity: The popularity denotes the popularity of the car, and it ranges from [1...4], where popularity equal to 1 represents an unacceptable car, popularity equal to 2 represents an acceptable car, popularity equal to 3 represents a good car, and popularity equal to 4 represents the best car. We also provide a test set of car along with the above attributes excluding popularity, in test.csv. The goal is to predict the popularity of the car based on its attributes.
Drosophila Melanogaster ----------------------- Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology. When its not being used for scientific research, *D. melanogaster* is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys). https://en.wikipedia.org/wiki/Drosophila_melanogaster About the Genome ---------------- This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA. ![D. melanogaster chromosomes][1] The genome is maintained and frequently updated at [FlyBase][2]. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.htmlfruitfly Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file. Bioinformatics -------------- Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: [Genetics][3], [Genomics][4] ([Sequencing][5]/[Genome Assembly][6]), [Chromosomes][7], [DNA][8], [RNA][9] ([mRNA][10]/[miRNA][11]), [Genes][12], [Alleles][13], [Exons][14], [Introns][15], [Transcription][16], [Translation][17], [Peptides][18], [Proteins][19], [Gene Regulation][20], [Mutation][21], [Phylogenetics][22], and [SNPs][23]. Of course, if you've got some idea of the basics already - don't be afraid to jump right in! Learning Bioinformatics ----------------------- There are a lot of great resources for learning bioinformatics on the web. One cool site is [Rosalind][24] - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see [Myles' solutions here][25] if you get stuck). We have set up [Biopython][26] on Kaggle's docker image which is a great library to help you with your analyses. Check out their [tutorial here][27] and we've also created [a python notebook with some of the tutorial applied to this dataset][28] as a reference. Files in this Dataset --------------------- <hr> **Drosophila Melanogaster Genome** - genome.fa The assembled genome itself is presented here in [FASTA format][29]. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case. <hr> **Meta Information** There are 3 additional files with meta information about the genome. - meta-cpg-island-ext-unmasked.csv This file contains descriptive information about CpG Islands in the genome. https://en.wikipedia.org/wiki/CpG_site - meta-cytoband.csv This file describes the positions of cytogenic bands on each chromosome. https://en.wikipedia.org/wiki/Cytogenetics - meta-simple-repeat.csv This file describes simple tandem repeats in the genome. https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) https://en.wikipedia.org/wiki/Tandem_repeat <hr> **Drosophila Melanogaster mRNA Sequences** Messenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism. https://en.wikipedia.org/wiki/Messenger_RNA - mrna-genbank.fa This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster. http://www.ncbi.nlm.nih.gov/genbank/ - mrna-refseq.fa This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster. http://www.ncbi.nlm.nih.gov/refseq/ <hr> **Gene Predictions** A gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This dataset includes a number of different gene prediction systems applied to the drosophila melanogaster genome. https://en.wikipedia.org/wiki/Gene_prediction - genes-augustus.csv AUGUSTUS is a piece of software that predicts genes ab initio using Hidden Markov Models. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC441517/ - genes-genscan.csv GENSCAN is an older ab initio software for predicting genes. http://genes.mit.edu/GENSCANinfo.html - genes-ensembl.csv - ensembl-gtp.csv - ensembl-pep.csv - ensembl-source.csv - ensembl-to-gene-name.csv Ensembl provides gene annotation generated by their software Genebuild. This process combines automatic annotation alongside manual curation. http://uswest.ensembl.org/info/genome/genebuild/genome_annotation.html We have also included some supplementary files for these, including predicted protein peptide sequences for each predicted gene. - genes-refseq.csv - genes-xeno-refseq.csv - refseq-link.csv - refseq-summary.csv We have included two RefSeq gene predictions in this dataset. The first is based solely on information from the drosophila melanogaster genome. The second (genes-xeno-refseq.csv) uses genes from other organisms as a basis for predicting genes in drosophila melanogaster. RefSeq RNAs were aligned against the D. melanogaster genome using blat; those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. We have also included supplementary files for these which include information about the genes that have been identified. http://www.ncbi.nlm.nih.gov/refseq/ <hr> What can you do with this data? ------------------------------- Genomic data is the foundation of bioinformatics, and there is an incredible array of things you can do with this data. A good place to start is to look at some of the meta supplementary files alongside the genomic sequence itself. We have a number of different gene prediction systems in the dataset, how do they compare to each other? How do they compare to the mRNA data? Working back from the refseq-summary.csv file, you can look at genes that code for particular proteins - can you find these genes in the genome? How much of the genome codes for the mRNA's found in our mRNA data? Of the mRNA's we have, how many map to the predicted genes and the predicted peptided sequence data? How much of the mRNA seems to be protein-coding vs how much looks like it is miRNA? Can you find pre-mRNA or splice variants within the mRNA data? Does meta information like cytogenic bands or CpG sites correspond with splice variants or a lack of mRNA altogether? Those are just some of many ideas that could get you started. Looking for Feedback -------------------- This is the first genomic dataset on Kaggle and we are looking for feedback from our community about how interesting this dataset is to them, or if there are ways we could improve it to better suit analysis. Please post suggestions for supplementary data, future genomes we could host, bioinformatics packages we should include on scripts, and any other feedback on the dataset forum. [1]: https://upload.wikimedia.org/wikipedia/commons/1/1d/Drosophila-chromosome-diagram.jpg [2]: http://flybase.org [3]: https://en.wikipedia.org/wiki/Genetics [4]: https://en.wikipedia.org/wiki/Genomics [5]: https://en.wikipedia.org/wiki/Sequencing [6]: https://en.wikipedia.org/wiki/Sequence_assembly [7]: https://en.wikipedia.org/wiki/Chromosome [8]: https://en.wikipedia.org/wiki/DNA [9]: https://en.wikipedia.org/wiki/RNA [10]: https://en.wikipedia.org/wiki/Messenger_RNA [11]: https://en.wikipedia.org/wiki/MicroRNA [12]: https://en.wikipedia.org/wiki/Gene [13]: https://en.wikipedia.org/wiki/Allele [14]: https://en.wikipedia.org/wiki/Exon [15]: https://en.wikipedia.org/wiki/Intron [16]: https://en.wikipedia.org/wiki/Transcription_(genetics) [17]: https://en.wikipedia.org/wiki/Translation_(biology) [18]: https://en.wikipedia.org/wiki/Peptide [19]: https://en.wikipedia.org/wiki/Protein [20]: https://en.wikipedia.org/wiki/Regulation_of_gene_expression [21]: https://en.wikipedia.org/wiki/Mutation [22]: https://en.wikipedia.org/wiki/Phylogenetics [23]: https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism [24]: http://rosalind.info/problems/list-view/ [25]: https://www.kaggle.com/mylesoneill/d/mylesoneill/drosophila-melanogaster-genome/rosalind-problem-solutions [26]: http://biopython.org [27]: http://biopython.org/DIST/docs/tutorial/Tutorial.html [28]: https://www.kaggle.com/mylesoneill/d/mylesoneill/drosophila-melanogaster-genome/getting-started-with-biopython [29]: https://en.wikipedia.org/wiki/FASTA_format
Context Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor\'s Pragmatic Chaos team. This is the dataset that was used in that competition. Content **This comes directly from the README:** TRAINING DATASET FILE DESCRIPTION ================================================================================ The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format: CustomerID,Rating,Date - MovieIDs range from 1 to 17770 sequentially. - CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users. - Ratings are on a five star (integral) scale from 1 to 5. - Dates have the format YYYY-MM-DD. MOVIES FILE DESCRIPTION ================================================================================ Movie information in "movie_titles.txt" is in the following format: MovieID,YearOfRelease,Title - MovieID do not correspond to actual Netflix movie ids or IMDB movie ids. - YearOfRelease can range from 1890 to 2005 and may correspond to the release of corresponding DVD, not necessarily its theaterical release. - Title is the Netflix movie title and may not correspond to titles used on other sites. Titles are in English. QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION ================================================================================ The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file. MovieID1: CustomerID11,Date11 CustomerID12,Date12 ... MovieID2: CustomerID21,Date21 CustomerID22,Date22 For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset. The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line. For example, if the qualifying dataset looked like: 111: 3245,2005-12-19 5666,2005-12-23 6789,2005-03-14 225: 1234,2005-05-26 3456,2005-11-07 then a prediction file should look something like: 111: 3.0 3.4 4.0 225: 1.0 2.0 which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc. You must make predictions for all customers for all movies in the qualifying dataset. THE PROBE DATASET FILE DESCRIPTION ================================================================================ To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id. MovieID1: CustomerID11 CustomerID12 ... MovieID2: CustomerID21 CustomerID22 Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset. If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faqprobe for that value. Acknowledgements The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt The contest was originally hosted at http://netflixprize.com/index.html The dataset was downloaded from [https://archive.org/download/nf_prize_dataset.tar][1] Inspiration This is a fun dataset to work with. You can read about the winning algorithm by BellKor\'s Pragmatic Chaos [here][2] [1]: https://archive.org/download/nf_prize_dataset.tar [2]: http://netflixprize.com/community/topic_1537.html
Context http://www.cell.com/cell/fulltext/S0092-8674(18)30154-5  Figure S6. Illustrative Examples of Chest X-Rays in Patients with Pneumonia, Related to Figure 6 The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia (middle) typically exhibits a focal lobar consolidation, in this case in the right upper lobe (white arrows), whereas viral pneumonia (right) manifests with a more diffuse ‘‘interstitial’’ pattern in both lungs. http://www.cell.com/cell/fulltext/S0092-8674(18)30154-5 Content The dataset is organized into 3 folders (train, test, val) and contains subfolders for each image category (Pneumonia/Normal). There are 5,863 X-Ray images (JPEG) and 2 categories (Pneumonia/Normal). Chest X-ray images (anterior-posterior) were selected from retrospective cohorts of pediatric patients of one to five years old from Guangzhou Women and Children’s Medical Center, Guangzhou. All chest X-ray imaging was performed as part of patients’ routine clinical care. For the analysis of chest x-ray images, all chest radiographs were initially screened for quality control by removing all low quality or unreadable scans. The diagnoses for the images were then graded by two expert physicians before being cleared for training the AI system. In order to account for any grading errors, the evaluation set was also checked by a third expert. Acknowledgements Data: https://data.mendeley.com/datasets/rscbjbr9sj/2 License: [CC BY 4.0][1] Citation: http://www.cell.com/cell/fulltext/S0092-8674(18)30154-5 ![enter image description here][2] Inspiration Automated methods to detect and classify human diseases from medical images. [1]: https://creativecommons.org/licenses/by/4.0/ [2]: https://i.imgur.com/8AUJkin.png
Arabic Handwritten Digits Dataset Abstract In recent years, handwritten digits recognition has been an important area due to its applications in several fields. This work is focusing on the recognition part of handwritten Arabic digits recognition that face several challenges, including the unlimited variation in human handwriting and the large public databases. The paper provided a deep learning technique that can be effectively apply to recognizing Arabic handwritten digits. LeNet-5, a Convolutional Neural Network (CNN) trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. A comparison is held amongst the results, and it is shown by the end that the use of CNN was leaded to significant improvements across different machine-learning classification algorithms. The Convolutional Neural Network was trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. Moreover, the CNN is giving an average recognition accuracy of 99.15%. Context The motivation of this study is to use cross knowledge learned from multiple works to enhancement the performance of Arabic handwritten digits recognition. In recent years, Arabic handwritten digits recognition with different handwriting styles as well, making it important to find and work on a new and advanced solution for handwriting recognition. A deep learning systems needs a huge number of data (images) to be able to make a good decisions. Content The MADBase is modified Arabic handwritten digits database contains 60,000 training images, and 10,000 test images. MADBase were written by 700 writers. Each writer wrote each digit (from 0 -9) ten times. To ensure including different writing styles, the database was gathered from different institutions: Colleges of Engineering and Law, School of Medicine, the Open University (whose students span a wide range of ages), a high school, and a governmental institution. MADBase is available for free and can be downloaded from (http://datacenter.aucegypt.edu/shazeem/) . Acknowledgements **CNN for Handwritten Arabic Digits Recognition Based on LeNet-5** http://link.springer.com/chapter/10.1007/978-3-319-48308-5_54 Ahmed El-Sawy, Hazem El-Bakry, **Mohamed Loey** Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016 Volume 533 of the series Advances in Intelligent Systems and Computing pp 566-575 Inspiration Creating the proposed database presents more challenges because it deals with many issues such as style of writing, thickness, dots number and position. Some characters have different shapes while written in the same position. For example the teh character has different shapes in isolated position. Arabic Handwritten Characters Dataset https://www.kaggle.com/mloey1/ahcd1 Benha University http://bu.edu.eg/staff/mloey https://mloey.github.io/
free online digital library that anyone can improve; Wikimedia project
US non-profit organization
online database project
freely editable world geographic database
free knowledge database project hosted by the Wikimedia Foundation and edited by volunteers
federal list of historic sites in the United States
place listed by the UNESCO as of special cultural or physical significance
online music metadata database
free database of the National Medical Library of the United States
volunteer effort to digitize and archive books
Online database for peptidases.
organization providing information on French cinema
online music database
online database of Broadway theatre productions and their personnel
inventory of the global conservation status of biological species
international authority file for personal names, subject headings and corporate bodies
Internet database of films, and movie professionals (actors, directors, screenwriters etc.)
online database of taxa
American website that collects review scores from both offline and online sources to give an average rating
premier British dictionary of the English language
International organization
international authority file
online database of burials
online dictionary of medical eponyms
annual list compiled and published by Fortune magazine
collaborative project intended to create an encyclopedia documenting all living species known to science
database arm of the US National Library of Medicine
authoritative taxonomic information on plants, animals, fungi, and microbes of North America and the world
Japanese digital library
American review aggregation website for film and television
database in the field of organic chemistry
treaty
An ontology for describing the function of genes and gene products.
regional database of daily updated census information
geographical database
classification of membrane proteins including ion channels
website that aggregates reviews of music albums, games, movies, TV shows, DVDs, and formerly, books
online multilingual dictionary
online database
digital library, online database and large-scale digitization project for biodiversity literature
online database with abstracts of medical articles, hosted by US National Library of Medicine
social webradio
collaborative compilation of information about the world's time zones
multilingual open-content collaborative map
controlled vocabulary for the purpose of indexing journal articles and books in the life sciences
service from Google
bibliographic database for economics
global partnership of conservation organisations that strives to conserve birds
Electronic index of zoological literature
English website about anime, manga and Japanese culture
website that tracks box office revenue
repository of scholarly manuscripts that are free to read
digital collection of European cultural heritage
is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources
web service providing access to resources of national libraries across Europe
catalog of human genes and genetic disorders and traits, with a particular focus on the gene-phenotype relationship
French digital library
German bibliography project (17th century prints)
company
printed and online English dictionary
International Architecture Database
New Testament books
art market company
french video game website
video games news and reviews website
database of compact disc track listings
many
data series in political science research
controlled vocabulary covering all areas of interest of the Food and Agriculture Organization of the United Nations (such as food, nutrition, agriculture, fisheries, forestry, environment etc.) published by FAO and edited by a community of experts
the Swiss Myocardial Infarction Registry
bibliographic database for marine science topics
online database of DOS video games
online clinical medical knowledge base
computer database of medieval Latin abbreviations
community-powered dictionary of slang terms
digital library
database for excellent scientists in Germany
Archive of Amiga-related software and files
bibliographic database
stock photo licensing company.
non-profit repository of high-quality, high-value media of endangered species
library catalog
hierarchical database that stores configuration settings and options on Microsoft Windows operating systems
United States independent agency
international digital library operated by UNESCO and the United States Library of Congress
national library in Japan
database of information about movie stars, movies and television shows
Korean film database
astronomical database
online project collecting example sentences
website and database about audio recordings
collaborative site for sharing musical scores
project for the creation of a virtual library of public domain music scores
collection, text corpus of ancient Egyptian funerary spells written on coffins
online resource
online information service produced by the United States National Library of Medicine
database
collection of protected architectural creations in the United Kingdom
ontology
online dictionary
online genealogy platform with web, mobile, and software products and services
web-based database of marine species
controlled vocabulary used for describing items of art, architecture, and material culture
digital library
theoretical and practical tool for information integration in the field of cultural heritage
archaeological database
vocabulary terms that can be used to describe web resources
database of scientific plant names
digital library
European fingerprint database for identifying asylum seekers and irregular border-crossers.
genealogy website
astronomical database
picture library in Dresden, Germany
іndex of chemicals
biomolecular database
internet portal for art history research and teaching
Shut down digital library
social cataloging web application
Digital Library portal for researchers in Astronomy and Physics, operated by the Smithsonian Astrophysical Observatory
commercial scientific social network
database of toxicity information
national Dutch documentation center of art history
international open access database of protein and nucleic acid structures
national heritage register of Australia
library
Mexico's principal government institution in charge of statistics and census data
online resource for fossil animals, plants, and microorganisms
central database maintained by the French Ministry of Culture
web-based database for the academic genealogy of mathematicians
geographical database available and accessible through various web services, operated by Unxos GmbH
collaborative database of audio snippets, samples, recordings, bleeps
global species database of fish species
dictionary of biographies of Canadian people published in both English and French
free pornographic video sharing website
international union library catalog
library
Russian Classification on Objects of Administrative Division
organization
German image database of art and architecture
online database of open access digital repositories throughout the world
website providing baseball statistics
Free online index to biographical reference works in the German language area
electronic library of the University of Szeged
online database of board games, game designers and game publishers worldwide
DNA sequence database
database and ontology of molecular entities focused on small chemical compounds
Hungarian digital library
database of protein sequence and functional information
streaming media system
biological database
database of plant names
UNESCO publication of endangered languages
digital cultural archive initiative that publishes free electronic versions of books significant to the culture and history of the Nordic countries
pornographic video sharing website
website on metal bands
Repository and publisher for data from earth system research (georeferenced)
web service that provides a searchable database of translations for a number of language pairs
library
data set of American English in 1961
global database of shark attacks
service on internet
United States government designation for food additives
Israel's principal government institution in charge of statistics and census data
project about 16th-century authors and publishers, run by Italy's Istituto Centrale per il Catalogo Unico
online architecture database
charity assessment organization that evaluates charitable organizations in the United States
former academic search engine
audiobook library
genealogy website
database
bioinformatics and cheminformatics database from the University of Alberta
organization for baseball history
website and weblabel
database of language struct
Internate database of the Swedish Film Institute
German online database about actors, films, TV series, video and advertising productions
Esperanto dictionary
periodical literature
digital library
social music cataloging, rating and reviewing website
knowledge base and artificial intelligence project
website
digital library of works mostly in Hungarian
dictionary
database of geographical objects
music streaming and recommendation website
Wiki-based lyric database
online project for book data of the Internet Archive
online subscription index of citations
an electronic archive of German-language text corpora of written language with over 42 billion words
german text corpus
computer science bibliography website hosted in Germany
public Internet library catalog in Germany
curated list of peer-reviewed Open-Access journals
Internet Database of Diplomatic Documents of Switzerland
Virtual library
ancient catalogue of the Library of Alexandria
Daily fixing of renminbi rates
pornographic video sharing site
Swiss digital library for antique works
bibliographic database
German National Library of Science and Technology (TIB) document delivery service
on-line database about opera
governmental database used by European countries
description of triangle related points
bibliographic database
gene sequence database
website
library catalog
Processing Network
database of famous people
astronomical database
Semantic Web ontology to describe relations between people
central service unit of the MPG
ontology for the domain of human anatomy
German periodical about films
subscription digital library
catalog of incunabula
Register of organizations managed by some countries for statistical purposes
book, website, database
online knowledge base (2007-2016)
citizen science project
virtual art gallery for European fine art created before 1900
digital library
Digital library of Bibliothèque Nationale de France
film and photography museum and archive in Rochester, New York, United States
organization
association promoting French cinema abroad
thesaurus of geographic names by the Getty Research Institute
inter-institutional terminology database of the European Union
German bibliography project (16th century prints)
chemical database
aggregator of scientific data on biodiversity; data portal
zoological reference book
medical bibliographical database
volunteer-run library of free content sheet music
catalogue of grapevine genetic resources
biographical reference work in Norwegian
geospatial extension for the PostgreSQL Database
Virtual specialist library for german language studies
online database collecting taxonomic information on all living reptile species
free resource offering access to experimental data characterizing antibody and T cell epitopes involved in infectious disease, allergy, autoimmunity, and transplant.
International Bibliography for Theology and Religious Studies
Astronomy objects catalog
digital library
largest assembly of data on the world's terrestrial and marine protected areas
German bibliography project
database management system developed by Stanford University
database providing an authoritative source of bibliographic information
dictionary of the Portuguese language
Swedish national union catalogue
Online database
multi-volume series covering all bird species
Free access Quebec digital library
non-profit organisation in the USA
international project to index all formal (scientific) names in the kingdom of Fungi
Internet Archive
free, online database and bioinformatics resource
German-speaking social-media-driven movie community
US Department of Education data collector and publisher
library catalog on the web
image database in the Netherlands
online dictionary
German computerized civil registry
Polish language film database
Dutch photograph press agency
record book of the Stationers' Company of London
database of biological pathways
online bibliographic database later known as Web of Science
database of chemical structures
online database of biochemical reactions
heritage register of Victoria, Australia
Dutch database of species
W3C recommendation designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or other structured controlled vocabularies
database of American death records
Internet project providing information about the diversity and phylogeny of life
database of chemicals owned by the Royal Society of Chemistry; see P661
oldest electronic library in the Russian Internet segment
catalog of all Swiss libraries
controlled, relational vocabularies of terms for the domain of systems biology
Russian website about cinematography
EC OJ bulletin of public procuremrnt tenders
Berlin animal voice archive
online database of energy-efficient appliances
schema for describing posts and interactions on forums, message boards, blogs etc.
website
website about Dutch language and Dutch literature
Database on private international law
thesaurus of artists and people by the Getty Research Institute
website documenting the known species of ants
Online database of the Max Planck Institute for the History of Science
Virtual specialist library
website
software tools for digital library collections
online database for fungi
database on tropical plants, mainly the ecozone Neotropica
system used by the libraries of French universities and higher education establishments to identify, track and manage the documents in their possession
The manually curated portion of the UniProt database of protein sequence information.
movie recommendation website
database
Mixed martial arts website
database of ZX Spectrum video games
biological database; expands official version of the Enzyme Nomenclature system
website
database about animals in Europe
social networking website for academics
defunct American news photo agency
database of scientific names for algae
library resource portal in France
amphibian database
online database of animal natural history, distribution, classification, and conservation biology
art historical database in Belgium
French national database under the Ministry of Ecology
Database for potato varieties
interactive website accessing anatomical data
US Department of Education online repository
organization
US American digital public library
project of the Unicode Consortium to provide locale data in XML format for use in computer applications
database of citations about engineering
online bibliographic database created by Universidad de La Rioja, Spain
General dictionary of Basque and its corpus
digital library platform
online database containing standardized peer-reviewed articles that describe specific heritable diseases
US SEC computer system
public site promoting maritime safety and quality
genealogical organization and website
Automated National File of Genetic Prints
set of works about the flora of Australia
French-language terminology databas
bibliographic database
database run by a multi-organisation initiative
database developed by the IUCN listing information about taxon which are deemed invasive in various countries and regions of the world
digital library
French open access repository
indexing database
non-profit organisation in the USA
digital archive created by the New York Public Library
open-access digital library from United States
open wiki to catalog food, nutrition facts and ingredients
scientific database
3D encyclopedia of proteins and other molecules
media annotation website
online interpretation specialist company
digital library supported by National and University library of Slovenia
federal list of historic sites of Canada
Database for the model organism Saccharomyces cerevisiae
project to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond
an online Scots-English dictionary
bibliographic database of nuclear science and technology
French national database of all companies
French botanical community
U.S. government database
The Arabidopsis Information Resource (TAIR) collects information and maintains a database of genetic and molecular biology data for Arabidopsis thaliana, a widely used model plant.
single window portal to integrate the digital repositories of India, sponsored by NME-ICT, MHRD, Govt. of India
research center
Czech and Slovak web project providing a movie database
taxonomic database
biological database
database containing information about non-coding RNA (ncRNA) families and other structured RNA elements
Italian-language anime, manga, and Japanese drama database website
chemical database
online public access catalog of the Library of Congress
pornographic website
internet-based database of comic book information
digital library
digital library
online library
Belgian foundation
astronomy database
star catalog
English-language anime and manga database website
web archive of Portugal
Russian website
ontology for descriptive linguistics
Russian and English online dictionary
corpus of the Russian language that has been partially accessible through a query interface online since 2004
the process of determining correspondences between concepts
speech audio files and text transcriptions
website
online dictionary
Compact disc database
Russian online library
Current Research Information System in Norway (CRIStin)
supplier of library and information data for all the Norwegian university and college libraries
Norwegian movie portal
database serves as the catalog and index for the collections of the United States National Agricultural Library
bibliographic database for topics related to religion
system that measures occasional harms from medications to ascertain whether the risk-benefit ratio is high enough
online genomic database
database of allele frequencies in the human genome
annual designation by the U.S. National Trust for Historic Preservation of 11 sites
amateur ornithological association, founded 1968
text corpus of American English
genomic database
digital library
digital library of Modern Greek studies
animal transcription factor database.
catalogue of published genome size estimates for various animal species
digital repository of marine science information and images
organization
research database
online dictionary produced by Oxford University Press
online database containing historical information on the performing arts in Australia
online database about Australian literature
computational comparative linguistics program
A system designed to collect, analyze, and respond to voluntarily submitted aviation safety incident reports
bibliographic database for life sciences
Binary subcomplexes in proteins database
multilingual semantic network and encyclopedic dictionary
E books.
identifier issued by the European and Mediterranean Plant Protection Organization (EPPO), to uniquely identify plants, pests and pathogens that are important to agriculture
online reference resource
Open-access digital library of Spanish-language texts
database of telephone calls maintained by the United States National Security Agency
database for protein and small molecule interactions
biological database
biological database
database of biological reactions
bibliographic database
provides basic biographical information on all past and present United States federal court Article III judges
Australian ornithological conservation organization
bitter compounds
digital library
bibliographic database covering humanities literature published in English
database of protein fragments
database of insects and arthropods
bibliographic database platform produced by CABI
database on evolutionary relationships of protein domains
database
Database of somatic cancer mutations
library
archival institution
organization
catalog of cultural heritage in the State of California, United States
digital library
government biodiversity commission in Mexico
biographical reference work
online catalogue for books
digital library of resources related to the history of agriculture in the United States
is a more than 560-million-word corpus of American English
online database of contemporary and historical documents relating to Irish history and culture
one of the official Digital Object Identifier Registration Agencies of the International DOI Foundation
international digital library project aimed at putting text and images of recovered cuneiform tablets online
project to digitize phonograph cylinders
UN livestock genetics programme
RDF/OWL schema for describing software projects
database for virologists
Database of Interacting Proteins
computer database file
American medical research faciity
database produced by the National Institutes of Health (NIH), National Library of Medicine (NLM), and the NIH Office of Dietary Supplements (ODS)
aggregator for New Zealand digitised content
digital library of comic books
preservation project
Georgia's state-wide cultural heritage digitization initiative
Database of Protein Disorder
formal ontology of human disease
digital library
gene and protein interaction database for Drosophila melanogaster
digital library
online database of bird observations
database
digital theatre archive based at the University of East London, in London, England, put offline in 2018
a scientific database for the bacterium Escherichia coli K-12 MG1655
database of biological information
database of published works
Turkish social networking service
is one of the biggest available parallel corpora involving the Arabic language
scientific project
set of documents of the proceedings of the European Parliament from 1996 to the present
database of biomedical research
Online database from the EBI on Nucleotides
exoplanet and star catalogue funded by NASA
image-archiving system
formal ontology
Brazilian film and TV social network
online database of Western Australian flora
multi-volume book and online database
filamentous fungi
database of administrative boundaries
database
High-resolution shoreline data set
online bioinformatics database
dictionary of location and spelling of geographical names in Australia
human gene databases
wiki-based collection of information related to human genes
collection of interconnected applications and databases that biologists use as repositories and as tools
biological database hosted by NCBI
bibliographic database of scientific literature in the geosciences
American psychologic bibliographic database
Georgia's virtual library and an initiative of the Board of Regents of the University System of Georgia
database of whole genome sequencing data of microorganisms
Comprehensive annotation resource for human genes and transcripts
Database at Stanford University that tracks 93 common mutations of HIV
database query language
database about hazardous chemical substances
digital, user-generated archive of historical photos, videos, audio recordings and personal recollections
Digital library
database of human metabolites
online database of proteins
research database focused on computer science, electrical engineering, electronics, and allied fields
integrated genomic resources of human cell lines for identification
open database for high energy physics research
online database on the history of science
information about elements of a schema in a database management system
linguistics database
A no longer updated database covering information about the proteomes of humans, mice and other animals
US national index of serious criminal histories
online, open access reference work covering recognition, biology, distribution, impact, and management of invasive plants and animals
virtual library providing access to academic literature for Iraqi universities and related institutions
Functionally related proteins across PPI networks
Guatemala's civil registry
chemical database of bioactive molecules with drug-like properties
Spanish electronic magazine
biological database
heritage institution
Database
British database of the Ministry of Labor and Pensions
is a million-word collection of British English texts which was compiled in the 1970s
Regional Cooperative Online Information System for Scholarly Journals from Latin America, the Caribbean, Spain and Portugal
information system about researchers and institutions in Brazil
online database that provides accurate names and info for prokaryotes according to ICNB
on-line bibliographic database in medicine and health sciences
astronomical catalogue
standardised patent data corpus available for research purposes
Online database
Microsatellites database
database
digital library of primary sources about 19th-c America
Database for putative Transcription Factor Binding Sites
online database
online library catalog of the University of California
website displaying song lyrics
biological database of microRNA sequences and annotations
database on microRNAs and their targets
mimotype database
Database of comparative protein structure models, calculated by the modeling pipeline ModPipe
online biological database
free online biological database
series of obituaries/biographies of fellows of the Royal College of Physicians
collection, database of virtual musical scores representing the logical content of the standard classical repertory from 1690 to 1890
music database
online music lyrics database
Nucleic acid phylogenetic profiling
epigenomics database
methylation data derived from next-generation sequencing data
biological database
database for uniquely identifying all the points of access to public transport in the UK
entity in UK
database on all bridges and tunnels in the United States
text corpus includes classics of Polish literature, daily and specialist press, conversation recordings, ephemera and Internet texts
U.S. drivers database
topographic data
organization
public domain geographic data collection
human protein knowledge bioinformatics resource
Digital library website of Catholic works
digital library of New Zealand and Pacific Island texts and materials
chess test suite
Nynorsk dictionary
freely accessible online library
digital library and database
database on corporate entities under share-alike, open licence
online language and dictionary service that gives you access to digital dictionaries from your computer, tablet or mobile phone.
DNA Replication Origin Database
catalogue of orthologous protein-coding genes across vertebrates, arthropods, fungi, plants, and bacteria
database of orthologous genes across multiple species
is a text corpus of 21st century English
comprehensive database on proteins
information system designed to support the biomedical research community’s work on bacterial infectious diseases
Presaging Critical Residues in Protein interfaces - DataBase
pictorial database of 3D structures in the Protein Data Bank
Pathogen Host Interaction database
organization digitizing the cultural heritage of the Punjab region of India and Pakistan
database
personal Disposable Income
dictionary of the Sumerian language
online directory of philosophy works
phosphorylation site database
Database of 3D structures of phosphorylation sites derived from Phospho.ELM
public database for catalogs of gene phylogenies
distributed data system that NASA uses to archive data collected by Solar System missions
Plant Proteome Database
database resource that links plant traits to genomics data.
organization promoting the development of persistent and openly accessible digital taxonomic literature
database of digitized books from the Early Modern period
Database of experimentally verified glycosites and glycoproteins of the prokaryotes
Database of protein repeats
non-profit, non-partisan research organization in the US
database of circular dichroism and synchrotron radiation
biological database
Database of pseudogenes annotations compiled from various sources
Canadian national digital repository
digital library
quilt documentation project hub and resource center for materials relevant to quilt study and primary source research
Database of resources for systems biology of DNA damage and repair
database of RNA-binding proteins
registry of open access policies
database for DNA restriction enzymes
bibliographic database and digital library of open access journals; funded by Universidad Autónoma del Estado de México
Maintains photographic and digital data as well as mission documentation and cartographic data. Each facility's general holding contains images and maps of planets and their satellites taken by solar system exploration spacecraft. Open to the public.
heritage register that listed natural and cultural heritage places in Australia that was closed in 2007
normalized dictionary for drugs and drug formulations from the National Library of Medicine, part of UMLS
database on aging
bibliographic database of open access journals
website about the history of British film, television and social history
global online database of information about marine life
Online database
ontology of sequence features used in biological sequence annotation
website
American music library of the Eastman School of Music, Rochester, NY
open access eLibrary; part of Elsevier
computational biology database
database of artworks of musée du Louvre
biological database
database of artificially engineered genes
Database in the United Kingdom
Norwegian dictionary
valuation for ecclesiastical taxation of English, Welsh, and Irish parish churches and prebends
national collaborative project
massive collection of digital media from before 2000
online database of metabolic pathways
knowledgebase of protein termini
online database of compounds toxic to human
Longterm biobank study of 500,000 people
ontology
a database of scientific publications
vaccine database
online database of viral genomes and bioinformatics tools
list of historic properties in the Commonwealth of Virginia, United States
Machine-readable description of an RDF data set
website cataloging sampling in music
database of biological pathways
on-line publisher of sheet music 2006-2013
Academic project focused on pre-20th-century English language women writers, their writings, and the reception of their work
project in Kew Gardens
covered bridge numbering system
citizen science ornithology project
knowledge base developed at the Max Planck Institute for Computer Science in Saarbrücken
database of chemical compounds
open access website, official ICZN taxonomic registry
subscription-based software as a service (SaaS) company based in Vancouver, Washington
online reference (authoritative)
database related to theater performances in Poland
multilingual online dictionary
public service system to provide resources sharing among academic libraries in China
Brazilian film site
is a linguistic corpus of Latin texts from ancient Galicia
database of Brazilian comics and artists
Internet database of movie scripts
Czech database for medicines
association football (soccer) website
Gene Disease Database
website showcasing sign languages worldwide
database indexing the audiovisual collections at the National Library of Sweden
Scholarly and Academic Information Navigator, Citation Information by NII (National Institute of Informatics)
online database of companies and start-ups
Event log on Windows NT systems.
Chinese website
electronic journal platform run by the Japan Science and Technology Agency
Database system by TDC Tedecy Software Engineering
Japanese online cataloging system
online dictionary
library
Japanese government biometric database
Polish digital library
Repository of data about research herbarium, most importantly standarized codes.
government database administered by the Polish Police
heritage institution
the National Digital Archives are one of three central archives of the state archive network in Poland
union database showing information on the holdings of Polish research and academic libraries
digital library of Christian Greek and Latin texts
digital newspaper collection, part of Digital materials of the National Library of Finland
Finnish register of social welfare and healthcare professionals
archives, part of Tampere University
Database of Norway protected areas
A registry for patients in Norway
free online cancer encyclopedia
system for recording and storing data about inhabitants
Czech digital library
Database, used in mobile radio systems
collection of 7667 Pathway/Genome Databases (PGDBs)
Terminological data base for Basque language
data base of the academic production in Basque language
library
Genbank of different varieties of fruit trees and shrubs
Danish film portal
online statistics database from Statistics Denmark
online taxonomic encyclopaedia
general-purpose multilingual Esperanto dictionary for the Internet
digital library of Hebrew religious Books & periodicals
Index of Articles on Jewish Studies
South Korean digital library and repository
digital table of contents of Hungarian scientific and technical journals
monumental fifty-volume series of primary sources for the study of Byzantine history
online English-Irish dictionary
stratigraphic database of the Netherlands
browsable database to view authority headings for subject, name, title and name/title combinations
Czech digital library
set of written texts in electronic form in the Czech language
biographical database of Chinese people
lithostratigraphic database of Germany
Online athletics database
database of French museums
curated classification and nomenclature for all of the organisms in the public sequence databases
Dutch photo agency
free and open-source multi-model NoSQL database developed by Apple
digital map database operated by the United States Geological Survey
database for Canadian geographic features
database of RNAs
non-commercial online mineralogical database
mineral database
mineral database
mineral database
database of members of the French Lower houses since 1789 maintained by the French National Assembly
Archive of research findings on subjective appreciation of life.
wiki which hosts texts and images that are in the public domain according to New Zealand copyright law. Similar to Wikisource and Wikimedia Commons
website with information about viruses
movie database for German movies
journal
a corpus management system in the program area Oral Corpora of the Institute for German Language
German database on hazardous substances
heritage register for objects of cultural heritage in Poland
film database
online repository of royalty-free music
archive of texts and translations of art songs and choral works
ontology
integrated web resource focused on maintaining a comprehensive database of broadly neutralizing HIV-1 antibodies
Johns Hopkins University
civil registry of the Netherlands
Basque digital library
heritage institution
pornographic video hosting service
Document and reference the Belgian public official legal status of Belgian Enterprises or organisations from 1950´s
internet database of visual novels
Russian scientific digital library, the largest legal research and educational resource of the Russian segment of the Internet
text corpus of Russian online texts, created in 2001-2012, allows to search with combining character patterns, morphological and syntactic features
annual list of the top 100 companies in Ghana
register of buildings protected against demolition, extension, or alteration without special permission
data format for metadata about datasets
Australian government biometric database
corpus of medieval English literature
sequence database of DNA barcoding
non-profit academic service provider; the only complete listing of all medieval and early modern manuscripts of European polyphonic music
initiative to coordinate the development of the community standards and formats for computational models in systems biology
Open-source software
citizen science project and website
collection of over 30,000 titles of American popular music spanning from the late 18th century to the early 20th century
digitised newspaper collection
online database on architects
collection of 29,000 pieces of American popular music spanning the years of 1780 to 1980
publications database of the U.S. National Technical Information Service
US online database of music video information
statutory list of heritage places in Queensland, Australia
Australian government register
research output digital online sharing platform
database that collects the expression patterns of Drosophila melanogaster in embryogenesis
former database of schools of public health and educational institutions
organization
not-for-profit open access digital library featuring research resources on public interest issues from Australia, New Zealand and international sources
An ontology for human phenotypes in hereditary and non-hereditary diseases.
data model for bibliographic description
US-based media company
integrative and comprehensive publicly available database and analysis resource to search, analyze, visualize, save and share data for influenza virus research
database of handwritten digits
website which tracks film box office revenue
system for validating protein structures
NMR spectroscopy database of carefully corrected or re-referenced chemical shifts
public archive of tree ring data
Database of transcription factor binding sites
open-access biological database
pharmacology database
interactive and machine access to commonly used ontologies, controlled vocabularies, and other lists for bibliographic description
full-text database of texts from Czech media
database
inter-governmental demographic data sharing system in the United States
predictor film
digital library
bibliographic database of the world's lesser-known languages, maintained at the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany.
organization
collection of information about Australian Aboriginal and Torres Strait Islander languages
Spanish database of journal articles
website; online music database of hymns and hymn authors
Post-1970 terrorist incident database by the University of Maryland, College Park
open database
Databases of ribosomal RNAs
online biological database
protein structure validation web server
archive of biodiversity-related scientific papers
library
online library database aggregator; hosted by the National Library of Australia
official wiki gathering mapping guidelines and explaining recommended tag usage for the OpenStreetMap project
tools that make open science practice easier
national register that contains basic information about Finnish citizens and foreign citizens residing permanently in Finland
multi-institutional repository, maintained by National Library of Finland
online artist database
online service that provides access to materials from Finnish museums, libraries and archives
database of scientific information in Poland
database of wild California plants
Dutch open data website
Online database of pipe organs in the United Kingdom
online publication of aircraft accident details
sports statistics website
Database about Ice hockey
community contributed taxonomic checklist of all vascular plants of Canada, Saint Pierre and Miquelon, and Greenland
database
Germplasm Resources Information Network
Database about Car Racing
Turkish Cinema Archive Database
digital library
Netherlands
database of grasses
Digital library of works of J. S. Bach and his family
online genealogy database
shared library cataloguing network in Australia
authority file for persons, organisations, works, topics, and geographic places, of the French National Library
A guide produced by NIOSH about hazardous chemicals
database on the relationships between human variations and phenotypes
online database of video games
Chinese-language anime, manga, and games database website
website about conifers
cooperative repository of open access Catalan journals
Swedish taxonomy database
open access repository run by CERN
Gene database
dataset of shipyards
Medieval imagery
dataset of ships
Chemical toxicity database by the U.S.A. Environmental Protection Agency
database of the birds of the world
database of mathematical software
free, collaborative database about films
species-metabolite relationship database
database with lipid structures
A database by the Department of Veterans Affairs maintaining FDA approved drug concepts and their interactions
database
not-for-profit digital archive
website on European royal families
web-based database for the academic genealogy
Dutch database
international research consortium
database of American birth, death records
website on victorian MPs since 1851
directory
digital library with topics related to Qatar
online database of the Austrian Parliament
database managed by the INHA from France
search engine for scientific articles and books
Canadiana website
portal of papyrological and epigraphical resources
Argentina's civil registry
An European Bioinformatics Institute web resource to search and visualize Biomedical ontologies
website on basketball
online database of gene-disease relationships
a research data repository
online database about female authors
web-based database for the academic genealogy
Archive center of the University of Pittsburgh
a project indexing and formatting FDA data, and making it accessible to the public
online project designed to be a "smart" search service for journal articles
website about Greek mythology
knowledge database lead by James Burke
index of bibliographic information on academic journals in the humanities and social sciences
the automatically annotated portion of UniProtKB which is uneviewed
german database and website
digital library
Croatian database of scientific papers
Cultural institution
chemical information database from University of Californy, Irvine
release of 11.5 million documents created by the Panamanian corporate service provider Mossack Fonseca
catalogue contains the holdings information of the German National Library starting from 1913
Public knowledge base providing Research Resource IDentifiers (RRIDs)
neuroimaging database
catalogue of scientific names of New Zealand biota
Database for nutrients
database of hepatics
database collecting three-dimensional structures of natural metabolites
database
database of researcher impact by Frontiers
collection of pollen and spores information in the Australasian region
database of recognised astronomical names on planets or their satellites
database
database of mass spectra
Database of CYP reactions
database for chemical compounds in patents, operated by the European Bioinformatics Institute (EBI)
Database of chemical entities
Database of NMR spectra
Repository for metabolomics data.
Database of bio-active chemical entities of nano size
Nucleotide Database of the National Center for Biotechnology Information
Wiki-database of plant lncRNAs
Pathosystems Resource Integration Center
database of CYP metabolism
freely available web resource of analytical technology services and products used in biomedical research, listing expertise and molecular resource capabilities available at research centres and biotech companies
Protein Database of the National Center for Biotechnology Information
database of biomedical nanotechnology research
Repository for nanosafety data.
Database of genome-scale metabolic networks.
Database of chemical and biological interactions
Collection of toxicogenomics data sets
WORLDWIDE CROCODILIAN ATTACK DATABASE
metabolomics database
Rett Syndrome Variation Database
curated and comprehensive summary of L1HS insertion polymorphisms identified in healthy or pathological human samples and published in peer-reviewed journals
DNA methylome programming database that integrates the genome-wide single-base nucleotide methylomes of gametes and early embryos in different model organisms
is a corpus of Russian internet texts that has been accessible on request through an online query interface since 2013
online Japanese-English dictionary
on-line knowledge resource on cell lines
biographical database of cultural industries in the Dutch and Flemish Golden Ages
biological database with maps of signaling and metabolic pathways
Database of metabolic fluxes
Database of apicomplexan metabolic pathways
Database of metabolomics data
biological database
open-access database storing curated, non-redundant transcription factor (TF) binding profiles
hierarchically structured, organism-independent, flexible and scalable controlled classification system enabling the functional description of proteins from any organism
free database of commercially-available compounds for virtual screening
German webportal for communication, media, and film studies
Dutch photographers' archive and image bank
institutional repository shared by Leeds, Sheffield and York Universities
image dataset
digital Medieval Latin library developed by the University of Zurich, Institute for Greek and Latin Philology
global demographic product created by the United States Census Bureau
comprehensive database for the fission yeast Schizosaccharomyces pombe, providing structural and functional annotation, literature curation and access to large-scale data sets
scientific database for bacteria
Indonesian music website; online music database; social music cataloging site
DIGITAALARHIIV: digital colletion of Estonia
database; index of all those who worked in the English and Welsh book trades up to 1851
German database about philology
U.S. Department of Education database about all public schools, districts, and state education agencies in the United States
virtual authority list for ancient people through Linked Data collection
website and database on English poetry 1579-1930
digital archive of books, pamphlets, and periodical essays illustrating the causes and controversies that preoccupied Byron and his contemporaries
database
database on mtDNA data integrated with longevity records
academic project of University College London
Catalogue of Life in Taiwan
chemical database of the European union for substances used as cosmetic ingredient
online database
public domain bibliographic database
Web application interface for viewing and editing microbial genetic data in Wikidata
database of biological enzymes
crowdsourced street-level photo database
prosopographical database of church musicians in France
bibliographic database
United States Army Corps of Engineers inventory of dams in the United States
national taxonomic reference of France
National-level gene storage biobank and data repository
chemical database
website for distributing open data
website of World Chess Federation with elo ratings of chess players
chemical database
database for drug discovery
biological database
art in urban space of the city of Zurich, Switzerland
proceeding series
American seismological website and database
public domain spectral chemical database
trilingual ontology consisting mainly of general concepts
online database about waterfalls
an open access, open source, community-driven web resource for Clinical Interpretation of Variants in Cancer
online database with information about digital art and artists
database with drug information
database of educational and research organizations
database of protein interaction data
biological pathway database
online service provided by United Nations Statistics Division (UNSD) of the Department of Economic and Social Affairs (DESA)
Ensembl database for accessing genome-scale data from plants.
database of monasteries, convents and collegiate churches in the Holy Roman Empire, created by the Germania Sacra research project
database with clinical trial data
database about cycling
database of libraries in the British Isles to 1850
deaths in California
Gene Ontology (GO) annotations database
artwork by Antoni Muntadas
tool for importing data sets into Wikidata
online flora on panarctic region
an interdisciplinary team dedicated to annotating gene function related to human fetal development
database for information on protein localization, interaction, functional assays and expression
represents GO annotations created in 2001 for NCBI and extracted into UniProtKB-GOA from EntrezGene
A systems biology approach to dissect cilia function and its disruption in human genetic disease
online database of the legislative information of the United States Congress
Brazilian music website
online bibliographic database
support assertions about things (such as scientific conclusions, gene annotations, or other statements of fact) that result from scientific research
Spanish online library
database of cultural heritage objects serving the Wiki Loves Monuments project
website about ice hockey
international prospective register of systematic review protocols
Online archive of genetic and phenotypic human data
botanical online database
bibliographic database of the Mathematical Reviews journal
NCATS: Global Ingredient Archival System (GINAS)
database maintained by government of Victoria, Australia, containing official names of geographic features within the state
taxonomic wiki
cloud storage server compatible with Amazon S3
is a group of clay tablets from Iron Age Syria
web service allowing authors to register and claim authorship of their works
community database resource for the laboratory use of zebrafish
scientific database
dictionary of the Breton language
a collection of 570k human-written English sentence pairs, supporting the task of natural language inference (NLI)
database maintained by the Parliament of Finland
online resource that helps users discover biographical and historical information about persons, families, and organizations that created or are documented in historical resources (primary source documents) and their connections to one another
international database of classical philologists hosted by the Aristarchus project
data model used by GeoNames database
GO annotation database
registry for metadata schemas and application profiles
facility in Stockholm, Sweden
other organization in Munich, Germany
dataset
virtual library
Psychology online preprint service
Elizabeth Hawley's climbing statistics
Online library
lobid-organisations is a directory of approximately 30,000 memory institutions (libraries, archives and museums) in Germany, Austria, and Switzerland.
The North Rhine-Westphalian Library Service Centre's (hbz) union catalogue as Linked Open Data. The hbz union catalogue records approximately 20 Millioen bibliographic tiles plus holding information. It contains cooperatively created title and holdings information from libraries in North Rhine-Westphalia and Rhineland-Palatinate.
an OBO Library ontology for environmental systems, components, and processes
dataset
A language database containing pronunciations. It is operated by, and available to emploees within, the Swedish quango media.
art website/database
dataset developed by Mikolov et al. 2013
collection database owned by the Smithsonian Institution
question-answering dataset
question-answering dataset
question-answering dataset
question-answering dataset
database of Requests for Comments publications
database of medieval and Renaissance manuscripts in the British Library
registry of civil aircraft registration marks in Canada
database of airline histories
free-content digital library of Jewish texts
compilation of academic papers about wikis
biological database and online resource for integrating genotype and phenotype data
database of file format identification patterns compiled by Gary Kessler
dataset by Pang and Lee
register of Australian women and their organisations
dataset by Hu & Liu from KDD'04
dataset by Julian McAuley from article published in 2015
dataset by Socher et al. from 2013
dataset by Andrew Mass et al from 2011 with 2 times 25,000 movie reviews
dataset by Weibe et al. from 2005
database of saints, first names and feasts
architectural heritage database in Portugal
Variant Annotation as a Service
reference dataset for knowlege graph algorithms
benchmark dataset for speech recognition
dataset
bibliographic database of electronic PhD, MD and DProf theses provided by the British Library
music website focused on cover versions
knowledgebase for lipid biology
ontology to annotate experiments in the field of the life sciences
database which contains information on tissue and biofluid expression of extracellular RNAs
multilingual corpus
open archive of the social sciences
civil registry of births, marriages, and deaths in the state of Victoria, Australia
civil registry for Queensland, Australia
documents leak related to offshore investment
bibliographic database of the ACM
online database of Arthurian texts, images, and scholarship at the Robbins Library, University of Rochester, New York
describes the elements used in the data export of Semantic MediaWiki
database of botanical journals
registry that provides and maintains identifiers for genetic variants
classification for visual arts, encoding system for visual elements in artworks
dataset for question-answering
bibliography of medieval literature
word similarity dataset created by Felix Hill
dataset for word similarity
OpenStreetMap-based dataset, made available under the Open Database License
open initiative whose aim is to enrich the Web of Data with Spanish geospatial data
voice dataset by Mozilla
database of early modern correspondence
suite of OWL 2 DL ontology modules for describing aspects of semantic publishing and referencing
2018 version of ontology for describing entities that are or may be published
ontology that enables characterization of the nature or type of citations
ontology meant to define bibliographic records, bibliographic references, and their compilation into bibliographic collections and bibliographic lists
online database of the world's tallest buildings
image dataset
image dataset
open archive of Swedish National Heritage Board
online database of open access mandates
open access platform for digitized journals in Switzerland
image database with 2,429 faces size 19x19 in the training set
database of the burial grounds of the Commonwealth War Graves Commission
Brazilian national heritage register for cultural assets of artistic value
research database in consciousness research and neuroscience
data set consists of 20000 messages taken from 20 newsgroups
Internet database of movie theaters
Internet database of American bridges based on the National Bridge Inventory
is a dataset containing a collection of English paragraphs with over 3 billion words
linked data server designed for ontologies
database of pesticides evaluated by the Joint Meeting on Pesticide Residues
database of plants used as agricultural crops, including their ideal growing conditions
biological database
online database on amphibian declines, natural history, conservation, and taxonomy
database website
database
dataset for situation recognition
library
online dictionary of the Norwegian language
directory
online database of water resources in California, United States of America
digital repository
database for rare and/or genetic diseases
online repository of Closure packages
online database of the world's plants
free library of 19th century medical texts
Netherlands
multi-volume historical dictionary of English slang
repository for data from nuclear magnetic resonance spectroscopy on biomolecules
database of projects funded by German Research Foundation
database of World War II memorials in the Netherlands
website that aggregates reviews of music albums
Central Library of NTUA
information system for Estonian museums
biological database
database of digitized images from the New York Public Library's collections
Dutch online architecture database
open data platform of the Biblioteca Nacional de España
dataset
large dataset with labeled videos
database of the Hungarian Parliament
data archive for technical sciences
cancer registry for the state of Missouri, USA
database by SpringerNature
database with biographical details of Fellows of the Royal College of Surgeons
website of the U.S. Government Publishing Office offering access to U.S. government documents that will replace FDsys
taxonomic database
Mexican Public Cultural Sector database
digital library of the Biblioteca Europea di Informazione e Cultura Foundation, Milan
english dictionary hosted on merriam-webster.com
website
Track and Field Results Reporting System website database for NCAA U.S. collegiate track and cross country
database
dataset
dataset
the OER World Map is a collaboratively maintained database doocumenting the growing number of actors and activities in the field of open education worldwide
Russian online information system
online database of contemporary music
Harvard University's open-access digital repository of research
The Sol Genomics Network (SGN) is a database and website dedicated to the genomic information of the nightshade family, which includes species such as tomato, potato, pepper, petunia and eggplant.
online database of manga, anime, games and media art
dataset
online Finnish dictionary
bibliographic database run by Harvard University Library
website devoted to silent films
database of historical women from book genre collective biographies
gene expression database
online database
data repository
ontology
ontology
Ontology for enabling interoperability of epidemic models and public health application software.
ontology
ontology
ontology
ontology
An ontology for biodiversity data
ontology
ontology
ontology
ontology
ontology
ontology
ontology
ontology
ontology
ontology
Ontology fo describe cell lines
an ontology for cell types
ontology
Ontology of small molecular entities of biological interest
Ontology for descriptors used in chemoinformatics database
ontology
ontology
ontology
ontology
ontology
ontology
Ontology of concepts and relations relevant to evolutionary comparative analysis
ontology
ontology
ontology
ontology
ontology
ontology
An ontology for types evidence used to support scientific claims.
ontology
An ontology of emotions, moods, and other kinds of feelings.
ontology
ontology
ontology
ontology
AN ontology of phenotypes of fission yeast.
ontology
ontology
ontology
An ontology for foodborne pathogens and associated outbreaks.
ontology
ontology
ontology
ontology
ontology
A set of ontologies related to infectious diseases.
An ontology of information entities.
ontology
An ontology for gene-gene interactions
ontology
ontology
ontology
An ontology for concepts related to malaria
ontology
An ontology for mental diseases
An ontology for concepts related to mental functioning
ontology
ontology
ontology
ontology
ontology
An ontology of mouse pathology phenotypes.
An ontology for organic reactions in organic synthesis.
ontology
ontology
ontology
ontology
ontology
An ontology for RNA function
ontology
An ontology for concepts related to biobanks
An ontology which describes biological processes, cellular components and molecular functions in living organisms
ontology
ontology
ontology
An ontology for general aspects of medicine, with a focus on cancer.
ontology
ontology
ontology
ontology
ontology
ontology
ontology
ontology
An ontology to describe biomedical statistics
ontology
ontology
ontology
ontology
An ontology for adverse events of vaccines.
Ontology Based Data Access refers to a range of semantic techniques, algorithms and systems developed to facilitate access to various types of data sources.
ontology
ontology
ontology
ontology
ontology
An ontology for describing groups of interacting organisms.
ontology
An ontology for the anatomy of sponges (Porifera)
ontology
ontology
ontology
ontology
ontology
ontology
ontology
An ontology for the provenance of scientific claims and supporting evidence.
ontology
ontology
ontology
An ontology to describe software applications with a focus on bioinformatics tools.
ontology
ontology
ontology
ontology
ontology
ontology
ontology
ontology
An ontology of clinical informed consents
An ontology for concepts related to vaccines and vaccination
ontology
Ontology of the anatomy of the African clawed frog (Xenopus laevis).
ontology
Bibliographic database of mostly German-language works on zoology.
Research Dataset
dataset
word analogy dataset
thematically and genre-balanced Polish language corpus with over 70 million words
Chemical Property Database
online search system for the United States Patent and Trademark Office's database of registered trademarks
free international research database for tertiary education
norwegian road data bank
open access institutional repository that provides access to the scholarly, educational, and creative works of the US-based University of Maine community
prosopographical directory of French learned societies' membership
terminology database
2010 conceptual model and ontology for describing entities that are published
US data center for the global PDB archive
European data center for the global PDB archive
number database
Loeb Classical Library volume
The Automated Weather Data Network (AWDN) gathers weather data for partners in agriculture and related fields
chemistry indexing and abstracting service
A Taiwan Linked Open Data Platform (data.odw.tw) build by the Institute of Information Science, Academia Sinica, Taiwan
international disaster database, located at the Centre for Research on the Epidemiology of Disasters (CRED), Université catholique de Louvain, Brussels, Belgium
library catalogue shared between several universities in Southern Italy
biological database
online Maltese lexicon
database of material phase diagrams
database about soybean genetics
database about legume traits
database about peanut genetics
database about the genetics of corn (Zea mays)
database for comparative plant biology
online database of Sega video games
specializes in the archiving, cataloging, and distributing of scientific data sets relevant to asteroids, comets and interplanetary dust
national database of coronial information on every death reported a Coroner in Australia and New Zealand
question-answering dataset
repository https://publikationsserver.tu-braunschweig.de/
repository http://www.edshare.soton.ac.uk/
repository http://libres.uncg.edu/ir/
repository http://ddd.uab.cat/
repository http://digitalcommons.wayne.edu/
repository http://www.freidok.uni-freiburg.de/
system for automated soil mapping based on global soil profile and covariate data
data set
statistics database of INSEE
authorised whole-of-government website for Commonwealth of Australia legislation and related documents
Fire Effects Information System online database
online database of mosquitoes
digital atlas of coral reefs
web mapping service of species' distribution
database of clinical features seen in mitochondrial diseases
online database
full-text database for works published by Springer
former central database of United Kingdom citizens
Finding aid search interface
Computer database of antiquities in Jordan
database compiled and managed by Annita Lucchesi
Electronic Flora of South Australia
materials database covering historic and contemporary materials used in the production and conservation of art, architecture, and archaeology by Museum of Fine Arts, Boston, Massachusetts
online database of names and descriptions of geologic units
online archive of geoscience data in the United States
a project at Indiana University Bloomington to advance women's art made in Europe (and later, in the US) during the 15th-19th centuries
FDA database
4TU.ResearchData provides an archive for researchers around the world for long-term access and curation of research datasets, with a focus on data from science, engineering and technology.
USGS Numbered Series
pharmaceutical literature database
database containing citations of historic medical literature
serial published 1880 - 1961
bibliographic database
Free global database of active blockchain businesses
The European Criminal Records Information System is an EU-System for Criminal Records.
heritage institution
Butterflies of India online database
Indian online database on moths
Odonata of India online database
Reptiles of India online database
Birds of India online database
Moths of North America online database
an online publication on recent pollen
Plantarium online database
Paleozoic ammonoid online database system
Catalogue of the Lepidoptera of Belgium online database
Leeds Robotic Commands is a dataset of real-world RGB-D scenes of a robot manipulating different objects together with natural language descriptions of these actions.
online database
digital library
USGS Database
taxonomy of fossil plants database
bibliographic database
database
database of students of Litchfield Law School and Litchfield Female Academy
Biological database
biological database
biological database
online taxonomic database on Psylloidea
large-scale (1000 hours) corpus of read English speech
corpus was made from audio talks and their transcriptions available on the TED website
website, online entertainment collectors database
crowd Sourced Emotional Multimodal Actors Dataset
database of publicly funded research in the UK
biographical reference for British women writers
ontology
ontology
ontology
ontology
RDF representation of the Microsoft Academic Graph
American theater news website
database for Latin and Ancient Greek dictionaries
set of email addresses and passwords
online database of fleurons automatically derived from scanned public domain texts
thesaurus database of the Consortium of European Research Libraries
online database of Commodore Amiga video games
biographical reference work
online database
online database on scale insects
European database to search and record orphan works
online resource on the magic lantern, an early slide projector invented in the 17th century
dataset based on the MuseumFinland project
Kalevala as semantic web
metadata about Finnish fiction literature created by the Finnish Public Libraries
botanical index
heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study
gazetteer of 630k person and 42k org names that provides spelling variants and EMM news about the entity (200k news per day!)
The ontology provides a vocabulary for expressing facts about topological (ordering) relations among instants and intervals, together with information about durations, and about temporal position including date-time information.
dataset
portal to digital archives of Japanese culture and history
website related to records of Comédie-Française theatre troupe, 1680-1791
The National Library of Medicine's web site for consumer information about genetic conditions and the genes responsible for those conditions.
linguistic ontology, tree of the meaning of the Arabic terms
Italian database
4th edition of Annie Besant's English translation with parallel Sanskrit text
online bibliographic database of archaeology
US National Library of Medicine's digital archive of scientists, physicians, and others who have advanced science
taxonomic database for Antarctic marine species
database of digital collection of Laval University library
reference dataset for knowlege graph algorithms
database regarding the tradition of Greek texts before the 16th century
database management system for culture collections in the world
a reference knowledge graph (ontology) to interoperate data and for machine learning
bibliographic database of work-level records
digital library of Latin prose texts of Late Antiquity (2nd-6th century AD)
digital library of ancient Latin texts
Norway's nationwide civil registry
Ugo and Olga Levi Foundation institutional repository and database.
A curated database of gene-disease panels
biological database
the myschool.edu.au site, a government source of compiled data
database of stratigraphic units in Australia
database of biological specimen records
online database of music video information
online database
map and database of Australian First Languages
Statistics database hosted by the International Labour Organization
online directory of libraries located globally
website by the Australian Government about orphanages, children's Homes, and other institutions
website and database of flora in South Australia
digital library of Latin poetry
database connecting pathogens to phenotypes
electronic library
scientific database of beach measurements in New South Wales, Australia
online store and database focused on electroacoustic music
linked data platform
database of Australian patent information managed by IP Australia
database of food compositions operated by Food Standards Australia New Zealand
database of unique codes for all New South Wales offences and Commonwealth offences dealt with in New South Wales
JUSTfind - the online public access catalog of the university library Giessen
A social network of and about cinema
database of algal taxa in Australia
database of scientific names for species
database of Indian academicians and scientists
a semantically annotated English corpus
website
dataset
dataset
dataset
dataset
dataset
dataset
dataset
dataset
dataset
dataset
dataset
medical database
database of common names for natural history collections
dataset
dataset
dataset
health research database
database of French theatre of the seventeenth and eighteenth centuries
project of the British Library
digital repository for scholarly materials produced by members of the University of North Carolina at Chapel Hill community
repository http://amsdottorato.unibo.it/
repository http://memory.loc.gov/ammem/index.html
repository http://rruff.geo.arizona.edu/AMS/amcsd.php
repository http://repository.alt.ac.uk/
repository http://agritrop.cirad.fr/
repository http://ageconsearch.umn.edu/
repository http://ahero.uwc.ac.za/
repository http://preprints.acmac.uoc.gr/
repository http://gtcni.openrepository.com/gtcni/
repository https://aperto.unito.it/
repository http://archiviomarini.sp.unipi.it/
digitize and liberate all public domain sheet music
motorsport results and statistics database
theses database
Canadian movie and television news website and online database
portal published by the ISSN International Centre, containing ISSNs assigned to serial publications
An ontology for the diverse roles behind a scientific research article.
Norwegian portal for collecting business information
GBIF node in Finland
IAEA database of nuclear power plants worldwide
abandonware database
dataset
dataset
register of historic places in Washington, D.C.
online database of the Museum of Modern Art
annual directory of library publishers
digital repository of doctoral theses from the Consejo de Universidades
photographic archive at the University of Chicago
register of heritage-listed places in New Zealand
Online database of movie theaters, distributors, movies and screenings in the Netherlands
An Indonesian academic repository
database of films, movie, actors, directors, etc., in Taiwan
digital library at the University of Patras
database of born-digital projects and resources for gender studies
site with detailed bibliographic info about an identity
clinical trials database
database
CORNELL NEWSROOM dataset
database
online flora for plants
a database of scholarly works, developed and maintained by MDPI (Q6715186). The name Scilit uses components of the words “scientific” and “literature”.
digital repository for archaeological research
speech database
website
dataset of bibliographic metadata
dictionary of the Aramaic language
bibliographic and full text database of agricultural information
biological database
online database
database of animal ageing and longevity
AI for object recognition in images and videos
chemical database of natural products
The LiverTox database in NCBI Bookshelf
website on amoeboid organisms
cybersecurity ontology
online database of video games
dataase maintained by the Ministry of Culture of Brazil
database of archives, libraries and museums of Regione Toscana
A publicly-accessible online system designed to facilitate the development, validation, curation and distribution of large-scale, evidence-based datasets for use in diagnostic variant filtering.
An ontology of disease symptoms
online map, marked-up texts, and descriptive gazetteer and encyclopedia of people, places, topics, and terms relating to London pre-1700
online database of video games
online database of video game music
genealogical database
system of registration of basic vital records such as birth, marriage and death
regional digital library in the United States, and service hub for DPLA
regional digital library in the United States, and service hub for DPLA
regional digital library in the United States, and service hub for DPLA
regional digital library in the United States, and service hub for DPLA
regional digital library in the United States, and service hub for DPLA
regional digital library in the United States, and service hub for DPLA
regional digital library in the United States, and service hub for DPLA
regional digital library in the United States, and service hub for DPLA
regional digital library in the United States, and service hub for DPLA
regional digital library in the United States, and service hub for DPLA
regional digital library in the United States, and service hub for DPLA
botanical database
knowledgebase provided by the WHO
Culture digital repository
digital archive of art and literature of antebellum New York
biological dataset
question-answering dataset
question-answering dataset
database
multiple monolingual text corpora
digital library of Latin poetry
database of union members
question-answering dataset
question answering dataset
the first annotated corpus of Russian language texts, developed since 1998, includes complete morphological and syntactic tags, includes disambiguated text
biological database
question-answering dataset
Mexican newspaper database
genome database
genome database
online database of Attic inscriptions
Historic England database of archaeological, architectural and maritime sites
digital library of South Dakota university collections
database of ancient authors and texts
national clinical trials registry
lyrics website
collaborative volunteer-run effort to track the COVID-19 outbreak in the United States
mobility report produced by Google
online database regarding Scholasticism
database of Spanish hospitals maintained by the ministry of health
Online directory of biobanks
database for historians using Wikibase
Pornographic film production company and distribution website
pornographic distribution platform
datasæt for composition language and visual recognition
scientific bibliographic database
medical database of systematic reviews
free online platform for language lovers and an online community
Land Information New Zealand (LINZ) database for official place names/coordinates
database jointly compiled by the French School at Athens and the British School at Athens
database compiled by the German Archaeological Institute
website about Dutch language and Dutch literature by the National Library of the Netherlands
Lexical data.base of Basque
database by the Koninklijke Bibliotheek on the history of printed books
a list of publicly known cybersecurity vulnerabilities
database of ecuadorian species
dataset of images
dataset of scenes
collection of over 100 up-to-date datasets relevant to California counties
Information database about captive and wild elephants
compendium of locales, maintained by the Center for Land Use Interpretation (Q5059738)
online taxonomic database
repository of digital items and collections
searchable repository of full text publications and citations by LSE staff
online archive of PhD theses for the London School of Economics and Political Science
Knowledge Graph containing historical photographs and metadata of Stuttgart State Theatres
source of accessibility information in the UK
computer vision dataset
distributor of documentary films in North America
Alexander Turnbull Library's catalogue for unpublished collections
database
ontology
Taxonomic database on Cephalopoda
biological database about glycans and glycoproteins
ZivaHub is the University of Cape Town's institutional open access data repository. It houses scholarly outputs of the University of Cape Town. Ziva is a Shona word meaning "to know".
digital single entry point service for all UNESCO resources
An electronic archive for digital resource materials in the fields of minority health and health disparities research and policy.
a digital repository in the USGS
Knowledge base used by Google which lets individuals create their profile on its search engine
publicly accessible database of vertebrate biodiversity data from natural history collections around the world
a research data repository in Taiwan
union catalog operated by Jisc
Historical financial database by Refinitiv
large
small
ComputerApplications_MISCELLANEOUS
DATE/TIME
File format
File name
File size
Uniform resource locator/link to file
80109 Pattern Recognition and Data Mining
170203 Knowledge Representation and Machine Learning
110906 Sensory Systems
79901 Agricultural Hydrology (Drainage
Flooding
Irrigation
Quality
etc.)
thermal Imaging
Molecular Biology
80699 Information Systems not elsewhere classified
signaling
Rattus norvegicus
59999 Environmental Sciences not elsewhere classified
acetylome
post-translational modification
Ecology
69999 Biological Sciences not elsewhere classified
host-pathogen interactions
Cancer
Science Policy
acetyltransferase
Toxoplasma gondii
magnesium
aluminium
ate complex
X-ray crystallography
Molecular Biology
80699 Information Systems not elsewhere classified
69999 Biological Sciences not elsewhere classified
Platycodon grandiflorum
Africa
Asia
Pliocene
Canarina canariensis
Ostrowskia magnifica
Cyclocodon lancifolium
Pleistocene
Canarina abyssinica
climate-driven extinction
continental islands
vicariance
nested phylogenetic dating
110309 Infectious Diseases
Canarina eminii
Canary Islands
Miocene
Cell Biology
long-distance dispersal
Pharmacology
Bayesian biogeography
Paleoecology
Uncategorised
Uncategorized
small
machine learning > classification
analysis > image processing
featured
technology and applied sciences > computing > computer science
data type > image data
medium
machine learning > deep learning
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
medium
featured
natural and physical sciences > nature > animals
human activities
natural and physical sciences > biology > ecology
natural and physical sciences > nature > plants
small
featured
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
general reference > research tools and topics > books
medium
15 Geothermal Energy
geothermal
Colorado
Routt County
Routt Hot Springs
Strawberry Park Hot Springs
reconnaissance
shallow temperature survey
air photo lineaments
groundwater
geochemistry
geology
geologic map
topographic map
geothermometry
map
small
Time Series Prediction
Statistics
Computational Biology
80301 Bioinformatics Software
60102 Bioinformatics
Ecology
Cancer
Pleistocene
community disassembly
Inorganic Chemistry
Biotechnology
Neuroscience
Developmental Biology
Rancholabrean
Holocene
Plant Biology
functional diversity
extinction
North America
megafauna
Biochemistry
60506 Virology
Mammalia
machine learning > classification
featured
data type > image data
medium
machine learning > deep learning
problem type > multiclass classification
culture and arts > games and toys
ComputingMilieux_COMPUTERSANDEDUCATION
GeneralLiterature_MISCELLANEOUS
Molecular Biology
Cell Biology
FIS distribution
gametic phase disequilibrium
29999 Physical Sciences not elsewhere classified
Markov chains
Cyclical parthenogenesis
de Finetti diagrams
Biophysics
Immunology
individual-based simulations
Health Care
Molecular Biology
69999 Biological Sciences not elsewhere classified
110309 Infectious Diseases
Pharmacology
Plant Biology
Immunology
endosymbiotic gene transfer
Microbiology
Medicine
eukaryotic phylogeny
tree of life
Computational Biology
Space Science
Genetics
Evolutionary Biology
sampling strategy
phylogenetics
microbial diversity
Uncategorised
Uncategorized
54 Environmental Sciences
ngee
ngee-arctic
barrow
alaska
Radiocarbon in CO2
Radiocarbon in soil
CO2 production
carbon mineralization
soil organic matter nitrogen concentration
soil organic matter carbon concentration
soil organic matter geochemistry
small
natural and physical sciences > nature > plants
natural and physical sciences > biology
ComputerApplications_COMPUTERSINOTHERSYSTEMS
technology and applied sciences > agriculture
medium
InformationSystems_INFORMATIONINTERFACESANDPRESENTATION(e.g.
HCI)
ComputingMethodologies_PATTERNRECOGNITION
analysis > nlp
medium
featured
technology and applied sciences > computing > internet
technology and applied sciences > computing > internet > twitter
society and social sciences > society > politics
geography and places > asia > russia
society and social sciences > social sciences > international relations
60102 Bioinformatics
Supplementary materials
Geophysics
Treatment
Fugacity of carbon dioxide (water) at sea surface temperature (wet air)
Carbonate ion
Mass
Sample ID
Ammonium
Type
pH
standard error
Calculated using seacarb after Nisumaa et al. (2010)
Uniform resource locator/link to reference
Nitrate and Nitrite
Alkalinity
total
Salinity
Carbon
inorganic
dissolved
Temperature
water
Potentiometric
Carbonate system computation flag
Carbon dioxide
Registration number of species
Bicarbonate ion
Aragonite saturation state
Phosphate
Chlorophyll c
Potentiometric titration
Calcite saturation state
Chlorophyll a
South Pacific
Partial pressure of carbon dioxide (water) at sea surface temperature (wet air)
Experiment duration
Species
small
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
Computer Science::Digital Libraries
Biomarkers
Quantitative Biology::Genomics
Statistics::Applications
Statistics::Methodology
Computer Science::Computer Vision and Pattern Recognition
Astrophysics::Galaxy Astrophysics
Computer Engineering
130309 Learning Sciences
150201 Finance
small
natural and physical sciences > physical sciences > space
Astrophysics::Earth and Planetary Astrophysics
Physics::Space Physics
small
featured
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
health and fitness > self care > exercise > weight training
featured
medium
technology and applied sciences > computing > internet
culture and arts > culture and humanities > food and drink
machine learning > classification
society and social sciences > society > business
society and social sciences > social sciences > linguistics
small
featured
data science
terrorism
Worldwide
forecasting models
conflict
geography and places > cities
File name
File size
Uniform resource locator/link to file
File format
Pygoscelis adeliae
Aptenodytes forsteri
Antarctica
Thalassoica antarctica
Functional Ecology @ AWI (AWI_FuncEco)
Development of a CCAMLR Marine Protected Area in the Antarctic Weddell Sea (WSMPA)
File content
Weddell Sea
Marine Protected Area (MPA)
Neuroscience
170205 Neurocognitive Patterns and Neural Networks
small
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
technology and applied sciences > transport > cycling
health and fitness > self care > exercise > sports
health and fitness > self care > exercise
ComputerSystemsOrganization_MISCELLANEOUS
small
90399 Biomedical Engineering not elsewhere classified
90302 Biomechanical Engineering
locomotion analyses
Biological Engineering
110903 Central Nervous System
110999 Neurosciences not elsewhere classified
90399 Biomedical Engineering not elsewhere classified
69999 Biological Sciences not elsewhere classified
Science Policy
Microbiology
Space Science
80699 Information Systems not elsewhere classified
19999 Mathematical Sciences not elsewhere classified
Data Format
medium
featured
Astrophysics::Galaxy Astrophysics
natural and physical sciences > physical sciences > space
natural and physical sciences > physical sciences > astronomy
natural and physical sciences > physical sciences > physics
natural and physical sciences > nature
data type > image data
featured
skin and connective tissue diseases
large
problem type > binary classification
technology and applied sciences > medicine
machine learning
small
society and social sciences > society > finance
society and social sciences > society > money
80699 Information Systems not elsewhere classified
110309 Infectious Diseases
Inorganic Chemistry
Neuroscience
Developmental Biology
Computational Biology
Evolutionary Biology
Genetics
Molecular Biology
Physiology
Marine Biology
80107 Natural Language Processing
stance classification
stance detection
fake news
rustance
featured
medium
large
society and social sciences > society > crime
data type > bigquery
geography and places > north america > united states
society and social sciences > society > crime > violence
Neuroscience
80599 Distributed Computing not elsewhere classified
89999 Information and Computing Sciences not elsewhere classified
80107 Natural Language Processing
Distributed Computing
80705 Informetrics
10401 Applied Statistics
Mathematics::Category Theory
small
80699 Information Systems not elsewhere classified
110309 Infectious Diseases
Inorganic Chemistry
Biotechnology
Biophysics
Immunology
Genetics
Molecular Biology
Exome capture
small
population structure
Pseudotsuga menziesii
positive selection
Southwestern Germany
14 SOLAR ENERGY
NREL
energy
data
low income
low and moderate income
lmi
pv
rooftop
technical potential
solar for all
photovoltaic
solar
tract
prediction
USA
LiDAR
residential
2011-2015
demographic data
cost-benefit analysis
Cancer
Medicine
small
machine learning > classification
geography and places > asia > india
people and self > personal life > entertainment
ComputingMilieux_MISCELLANEOUS
featured
medium
ComputingMilieux_MISCELLANEOUS
culture and arts > culture and humanities > popular culture
culture and arts > performing arts > film
Molecular Biology
59999 Environmental Sciences not elsewhere classified
Microbiology
Computational Biology
Evolutionary Biology
sexual selection
social behaviour
fluids and secretions
integumentary system
inclusive fitness
aggression
Drosophila melanogater
sexual conflict
kin selection
parasitic diseases
virus diseases
60102 Bioinformatics
Applied Computer Science
80106 Image Processing
Artificial Intelligence and Image Processing
80199 Artificial Intelligence and Image Processing not elsewhere classified
170299 Cognitive Science not elsewhere classified
170205 Neurocognitive Patterns and Neural Networks
179999 Psychology and Cognitive Sciences not elsewhere classified
small
featured
medium
health and fitness > self care > exercise > sports > horse racing
data type > image data
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
analysis > nlp
large
analysis > image processing
Paleoecology
100509 Video Communications
100599 Communications Technologies not elsewhere classified
small
ComputingMilieux_MISCELLANEOUS
Computational Biology
60102 Bioinformatics
Plant Biology
Data_FILES
60702 Plant Cell and Molecular Biology
Applied Computer Science
80106 Image Processing
Artificial Intelligence and Image Processing
80199 Artificial Intelligence and Image Processing not elsewhere classified
69999 Biological Sciences not elsewhere classified
Pharmacology
Developmental Biology
Biochemistry
Microbiology
Genetics
Evolutionary Biology
Molecular Biology
Physiology
Sceloporus occidentalis
Sceloporus zosteromus
Sceloporus cowlesi
Sceloporus hunsakeri
Sceloporus torquatus
Sceloporus smithi
Sceloporus bicanthalis
Sceloporus adleri
Sceloporus woodi
Sceloporus graciosus
Sceloporus gadoviae
Sceloporus horridus
Sceloporus tristichus
Comparative Biology
Sceloporus magister
Sceloporus ochoterenae
animal structures
Genomics/Proteomics
Sceloporus malachiticus
Sceloporus variabilis
Sceloporus taeniocnemis
Sceloporus edwardtaylori
Sceloporus grammicus
Sceloporus jalapae
Sceloporus orcutti
Reptiles
Sceloporus palaciosi
Sceloporus spinosus
Sceloporus siniferus
Sceloporus angustus
Sceloporus utiformis
Sceloporus formosus
Sceloporus carinatus
Sceloporus clarkii
Sceloporus mucronatus
Sceloporus olivaceus
Sceloporus exsul
Gene Structure and Function
Sceloporus scalaris
Sceloporus licki
small
featured
natural and physical sciences > nature > animals
initiatives > socrata
mathematics and logic > statistics > time series
Ecology
69999 Biological Sciences not elsewhere classified
110309 Infectious Diseases
Plant Biology
North America
Genetics
Evolutionary Biology
Inorganic Chemistry
phylogenetic
acoustic communication
Aves
Caribbean
acoustical environment
habitat use
foraging ecology
Europe
novel environments
sensory ecology
File name
File size
Uniform resource locator/link to file
File format
File content
Analytical method
ORDINAL NUMBER
Comment
small
Data_FILES
small
170299 Cognitive Science not elsewhere classified
Geology
Ecology
69999 Biological Sciences not elsewhere classified
Plant Biology
Evolutionary Biology
USA
Europe
Population Genetics - Empirical
Adaptation
Australia
Natural Selection and Contemporary Evolution
Agriculture
Raphanus raphanistrum
Ecological Genetics
69999 Biological Sciences not elsewhere classified
Plant Biology
Biochemistry
Genetics
80699 Information Systems not elsewhere classified
19999 Mathematical Sciences not elsewhere classified
Dolichonyx oryzivorus
landscape buffer
scale of effect
kernel
landscape context
Passerculus sandwichensis
spatial scale
Pterostichus Melanarius
distance decay
Canada
landscape management
landscape structure
Habitat model
landscape extent
Pain
Questionnaire
Pain education
110399 Clinical Sciences not elsewhere classified
Neurophysiology
Measurement
Computer Engineering
Data_MISCELLANEOUS
91303 Autonomous Vehicles
Ecology
50202 Conservation and Biodiversity
50211 Wildlife and Habitat Management
80604 Database Management
Benchmarking
InformationSystems_DATABASEMANAGEMENT
SPARQL
Log Analysis
Triple Stores
InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
small
initiatives > socrata
small
small
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
ComputingMethodologies_PATTERNRECOGNITION
ComputingMilieux_MISCELLANEOUS
Data_GENERAL
Health Care
60301 Animal Systematics and Taxonomy
54 Environmental Sciences
ngee
ngee-arctic
respiration
methane production
methane oxidation
soil incubation
Snow thickness
Snow cover fraction
DATE/TIME
Digital camera CC640
80106 Image Processing
80103 Computer Graphics
ComputingMethodologies_PATTERNRECOGNITION
Bioinformatics
Data_MISCELLANEOUS
small
medium
ComputingMethodologies_PATTERNRECOGNITION
society and social sciences > society > business
Computer Science::Machine Learning
Computer Science::Computer Vision and Pattern Recognition
Computer Science::Sound
Statistics::Machine Learning
Computer Science::Neural and Evolutionary Computation
data type > image data
medium
featured
Computational Physics
20599 Optical Physics not elsewhere classified
algorithms > neural networks
relevance assessment
80704 Information Retrieval and Web Search
Neuroscience
Imaging
Genetics
Ecology
69999 Biological Sciences not elsewhere classified
Plant Biology
Genetics
Evolutionary Biology
80699 Information Systems not elsewhere classified
Deep-sea
Gulf of California
Xenoturbellida
Phylogenomics
Monterey Canyon
Paleoecology
111706 Epidemiology
small
ComputingMilieux_COMPUTERSANDEDUCATION
small
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
small
ComputingMethodologies_PATTERNRECOGNITION
machine learning > classification
InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
InformationSystems_INFORMATIONSYSTEMSAPPLICATIONS
ComputingMilieux_COMPUTERSANDSOCIETY
data type > text data
analysis > text mining
ComputingMilieux_LEGALASPECTSOFCOMPUTING
small
featured
society and social sciences > social sciences > sociology
people and self > personal life > love
69999 Biological Sciences not elsewhere classified
Cancer
Science Policy
110309 Infectious Diseases
Pharmacology
Evolutionary Biology
80699 Information Systems not elsewhere classified
Physiology
Marine Biology
Extracorporeal perfusion
free flap transplantation
rat model
microsurgery
membrane oxygenator
ECMO
tissue perfusion
39999 Chemical Sciences not elsewhere classified
small
40402 Geodynamics
InformationSystems_DATABASEMANAGEMENT
InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > PRECIPITATION RATE
small
analysis > image processing
featured
data type > image data
technology and applied sciences > computing > internet
technology and applied sciences > computing > computer security
small
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
ComputingMethodologies_GENERAL
Ecology
69999 Biological Sciences not elsewhere classified
Science Policy
Evolutionary Biology
ComputingMethodologies_PATTERNRECOGNITION
80699 Information Systems not elsewhere classified
Data_MISCELLANEOUS
flower colour polymorphism
Mediterranean area
Iris pumila
pollen limitation
phenotypic selection
Iris lutescens
East Europe
69999 Biological Sciences not elsewhere classified
Science Policy
Neuroscience
Plant Biology
Immunology
Genetics
80699 Information Systems not elsewhere classified
Quantitative genetics and Mendelian inheritance
quantitative genetics
salmonid Subject area: Genomics and gene mapping
qPCR
quantitative trait loci
medium
medium
featured
GeneralLiterature_MISCELLANEOUS
culture and arts > arts and entertainment > humor
medium
featured
GeneralLiterature_MISCELLANEOUS
culture and arts > arts and entertainment > humor
small
110201 Cardiology (incl. Cardiovascular Diseases)
Uncategorized
69999 Biological Sciences not elsewhere classified
Science Policy
Genetics
ComputingMethodologies_PATTERNRECOGNITION
80699 Information Systems not elsewhere classified
Data_MISCELLANEOUS
analyses
example dataset
Triassic
Permian
small
small
ComputingMilieux_COMPUTERSANDEDUCATION
GeneralLiterature_MISCELLANEOUS
ComputingMethodologies_PATTERNRECOGNITION
InformationSystems_INFORMATIONSYSTEMSAPPLICATIONS
ComputingMilieux_MANAGEMENTOFCOMPUTINGANDINFORMATIONSYSTEMS
featured
medium
society and social sciences > society > finance
health care economics and organizations
natural and physical sciences > biology > health sciences > public health > healthcare
ComputingMethodologies_PATTERNRECOGNITION
Data_MISCELLANEOUS
Ecology
population density
mammals
reptiles
abundance
birds
amphibians
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
Computer Engineering
small
featured
medium
analysis > nlp
data type > text data
analysis > text mining
audience > beginner
analysis > data visualization
small
machine learning > classification
society and social sciences > society > finance
featured
small
society and social sciences > social sciences > sociology
society and social sciences > social sciences > linguistics
featured
data type > image data
medium
machine learning > deep learning
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
problem type > multiclass classification
people and self > personal life > clothing
culture and arts > visual arts > photography
Uncategorized
small
GeneralLiterature_MISCELLANEOUS
ComputingMilieux_LEGALASPECTSOFCOMPUTING
ComputingMilieux_THECOMPUTINGPROFESSION
small
medium
Ecology
50301 Carbon Sequestration Science
Environmental Science
Molecular Biology
Ecology
69999 Biological Sciences not elsewhere classified
110309 Infectious Diseases
Pharmacology
Inorganic Chemistry
Biochemistry
Genetics
Biodiversity
Natural resources policy
Sociology
Synergy and trade-off
French Alps
Multi-scale assessment
Ecosystem service association
Biophysical assessment
Landscape heterogeneity
Library and Information Studies
90399 Biomedical Engineering not elsewhere classified
80704 Information Retrieval and Web Search
query formulations
Molecular Biology
Evolutionary Biology
Asteraceae
60408 Genomics
polymorphisms
60309 Phylogeny and Comparative Analysis
sunflower family
duplication
probabilistic models
dysploidy
polyploidy
ancestral chromosome number
60409 Molecular Evolution
small
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
ComputingMilieux_THECOMPUTINGPROFESSION
Environmental Science
80704 Information Retrieval and Web Search
w3c
80505 Web Technologies (excl. Web Search)
datasets
descriptions
Linked Data
small
medium
featured
ComputingMethodologies_PATTERNRECOGNITION
large
small
Uncategorised
Uncategorized
small
featured
machine learning
technology and applied sciences > computing > computer security
society and social sciences > society > crime
general reference > reference works > web sites
Data_MISCELLANEOUS
ComputingMethodologies_PATTERNRECOGNITION
80612 Interorganisational Information Systems and Web Services
80699 Information Systems not elsewhere classified
69999 Biological Sciences not elsewhere classified
Developmental Biology
Genetics
Molecular Biology
39999 Chemical Sciences not elsewhere classified
Biochemistry
Climate Science
50204 Environmental Impact Assessment
50206 Environmental Monitoring
Hydrology
40107 Meteorology
Soil Science
general reference > research tools and topics > books
small
analysis > nlp
culture and arts > culture and humanities > languages
machine learning > recommender systems
Ecology
69999 Biological Sciences not elsewhere classified
Cancer
110309 Infectious Diseases
Immunology
Microbiology
Genetics
80699 Information Systems not elsewhere classified
Physiology
Marine Biology
Mixed-species bird flocks
Mutualisms
South Asia
Anthropogenic disturbance
Human-modified ecosystems
Biodiversity loss
Species interaction networks
Anthropocene
Fungi data
30101 Analytical Spectrometry
Spectroscopy data
small
people and self > personal life > entertainment
ComputingMilieux_MISCELLANEOUS
culture and arts > performing arts > film
80505 Web Technologies (excl. Web Search)
QA
Applied Computer Science
Semantic Web
Question Answering
80607 Information Engineering and Theory
Solar System
Solar Physics
Planets and Exoplanets
80104 Computer Vision
Neuroscience
Molecular Biology
59999 Environmental Sciences not elsewhere classified
Ecology
69999 Biological Sciences not elsewhere classified
Pliocene
Pleistocene
110309 Infectious Diseases
Miocene
Cell Biology
Biochemistry
60506 Virology
Genetics
Evolutionary Biology
Poecilia sphenops
Costa Rica
Oligocene
Poecilia mexicana limantouri
taxonomy
hybridization
species trees
Poecilia
cryptic species
Guatemala
coalescent
Panama
Poecilia orri
Poecilia sulphuraria
Poecilia catemaconis
Poecilia mexicana
incomplete lineage sorting
Poecilia butleri
El Salvador
Nicaragua
freshwater fishes
conservation
Mexico
Central America
Poecilia gillii
Poecilia hondurensis
general mixed Yule-coalescent (GMYC)
Honduras
non-adaptive radiations
Poecilia mexicana mexicana
Bayesian species delimitation
featured
medium
machine learning > deep learning
analysis > nlp
technology and applied sciences > computing > internet > twitter
society and social sciences > social sciences > linguistics
large
algorithms > neural networks
culture and arts > culture and humanities > languages
80106 Image Processing
80103 Computer Graphics
10401 Applied Statistics
ComputingMilieux_THECOMPUTINGPROFESSION
80309 Software Engineering
File size
Uniform resource locator/link to file
File name
small
featured
ComputingMethodologies_PATTERNRECOGNITION
society and social sciences > society > business
machine learning > regression analysis
Computer Engineering
small
featured
society and social sciences > society > business
algorithms > decision tree
people and self > personal life > employment
Applied Computer Science
Mathematics::Analysis of PDEs
Mathematics::Numerical Analysis
anomaly detection framework
time series analysis technique
Machine Learning Techniques
featured
large
health and fitness > self care > positive psychology > mental health
humanities
natural and physical sciences > biology > health sciences > public health
medium
medium
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
health and fitness > self care > exercise > sports
technology and applied sciences > computing > computer science
technology and applied sciences > computing > computer engineering
69999 Biological Sciences not elsewhere classified
Cancer
Science Policy
110309 Infectious Diseases
Biotechnology
Plant Biology
Biochemistry
80699 Information Systems not elsewhere classified
19999 Mathematical Sciences not elsewhere classified
39999 Chemical Sciences not elsewhere classified
9 configurations
Supplementary Data 7 Supplementary Data 7. Dataset
analysis
script
TNT
small
medium
small
ComputingMethodologies_PATTERNRECOGNITION
medium
91299 Materials Engineering not elsewhere classified
30307 Theory and Design of Materials
30306 Synthesis of Materials
Neuroscience
Imaging
10401 Applied Statistics
ComputingMilieux_THECOMPUTINGPROFESSION
80309 Software Engineering
featured
medium
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
society and social sciences > social sciences > linguistics
InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
culture and arts > culture and humanities > languages
technology and applied sciences > computing > artificial intelligence
ComputingMethodologies_ARTIFICIALINTELLIGENCE
50202 Conservation and Biodiversity
species abundance distributions
small
Molecular Biology
80699 Information Systems not elsewhere classified
69999 Biological Sciences not elsewhere classified
110309 Infectious Diseases
Cell Biology
Biochemistry
Biophysics
Microbiology
Genetics
Evolutionary Biology
39999 Chemical Sciences not elsewhere classified
analyses
sequence length
species identification efficiency
content
DNA barcoding gaps
GC
Dataset
Artificial Intelligence and Image Processing
Climate
Cloudbase
cloud base
Neuroscience
Data_MISCELLANEOUS
Imaging
Ecology
69999 Biological Sciences not elsewhere classified
Neuroscience
Genetics
Evolutionary Biology
80699 Information Systems not elsewhere classified
Adaptation
Population Genetics - Empirical
Insects
Conservation Genetics
Speciation
Uncategorised
Uncategorized
Molecular Biology
Ecology
North America
Biochemistry
Mammalia
Microbiology
Evolutionary Biology
19999 Mathematical Sciences not elsewhere classified
Price equation
Species selection
Palaeocene
Eocene
Macroevolution
Body size
Palaeocene/Eocene boundary
Wyoming
Bighorn Basin
Clarks Fork Basin
Neuroscience
170205 Neurocognitive Patterns and Neural Networks
179999 Psychology and Cognitive Sciences not elsewhere classified
Developmental and Educational Psychology
adult ages
Naturalistic Stimuli
170299 Cognitive Science not elsewhere classified
Teenagers
170102 Developmental Psychology and Ageing
Development Characteristics
170112 Sensory Processes
Perception and Performance
Movies
Human Brain Activation
fMRI image analysis approach
Behavioral Neuroscience
Neuroscience and Physiological Psychology
humanities
IUGR dataset
ntrauterine growth restriction
low birth weight
preterm birth
preterm labor
premature rapture of membranes
prenatal care
140301 Cross-Sectional Analysis
Climate
WAGHC
observational data
ocean
Molecular Biology
80699 Information Systems not elsewhere classified
69999 Biological Sciences not elsewhere classified
110309 Infectious Diseases
Biochemistry
Microbiology
Computational Biology
Genetics
Evolutionary Biology
19999 Mathematical Sciences not elsewhere classified
Gene Structure and Function
Population Genetics - Empirical
Conservation Genetics
Host Parasite Interactions
Hybridization
Quebec
Salvelinus fontinalis
Parasitology
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
80106 Image Processing
Earth Observation Satellites > LANDSAT > LANDSAT-8
EARTH SCIENCE > CRYOSPHERE > GLACIERS/ICE SHEETS > GLACIERS
Landsat 8
Debris-covered glaciers
Remote sensing
EARTH SCIENCE SERVICES > MODELS > CRYOSPHERE MODELS
Earth Observation Satellites > Sentinel GMES > SENTINEL-2
Sentinel-2A/B
Cell Biology
FIS distribution
gametic phase disequilibrium
29999 Physical Sciences not elsewhere classified
Markov chains
Cyclical parthenogenesis
de Finetti diagrams
Immunology
individual-based simulations
Genetics
19999 Mathematical Sciences not elsewhere classified
education
behavioral disciplines and activities
80301 Bioinformatics Software
80107 Natural Language Processing
80607 Information Engineering and Theory
Computation Theory and Mathematics
small
featured
ComputingMethodologies_PATTERNRECOGNITION
Data_MISCELLANEOUS
data type > tabular data
featured
culture and arts > culture and humanities > food and drink
small
digestive
oral
and skin physiology
small
GeneralLiterature_MISCELLANEOUS
Paleoecology
gridded data
EARTH SCIENCE > AGRICULTURE > AGRICULTURAL PLANT SCIENCE > CROPPING SYSTEMS
soil management
plowing
ploughing
tillage
EARTH SCIENCE > AGRICULTURE > SOILS
Conservation Agriculture
small
mathematics and logic > statistics > time series
audience > beginner
analysis > time series analysis
machine learning > forecasting
small
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
technology and applied sciences > computing > internet
ComputingMilieux_MISCELLANEOUS
Database
featured
data type > image data
medium
problem type > multiclass classification
Image Processing
culture and arts > culture and humanities > food and drink
Data Analysis
Fresh Fruits
Convolutional Neural Network
179999 Psychology and Cognitive Sciences not elsewhere classified
education
Applied Psychology
Helmholtz-Verbund Regionale Klimaänderungen = Helmholtz Climate Initiative (Regional Climate Change) (REKLIM)
Paleo Modelling (PalMod)
Ecology
Cell Biology
Biochemistry
Microbiology
assembly motif
functional effect groups
theoretical ecology
clustering
community
modelling
combinatorics
dictionaries
encyclopedias
glossaries)
GeneralLiterature_REFERENCE(e.g.
small
small
featured
medium
geography and places > asia > india
health and fitness > self care > positive psychology > mental health
natural and physical sciences > nature > death
society and social sciences > society > health
medium
health and fitness > self care > exercise > sports > horse racing
Horse Racing
Statistics
small
featured
medium
culture and arts > visual arts
culture and arts > arts and entertainment > museums
culture and arts > arts and entertainment
culture and arts > culture and humanities
DATE/TIME
Randolph Glacier Inventory 6.0
glacier ID
Flag
Elevation of event
Snow density
uncertainty
Longitude of event
Event label
Number
Snow water equivalent
DEPTH
ice/snow
Latitude of event
Density
snow
Snow depth
59999 Environmental Sciences not elsewhere classified
69999 Biological Sciences not elsewhere classified
Cancer
110309 Infectious Diseases
Biochemistry
Immunology
80699 Information Systems not elsewhere classified
39999 Chemical Sciences not elsewhere classified
Parasitology
Alveolata
Polychromophilus
Plasmodium
Haemoproteus
Parahaemoproteus
Malaria
Leucocytozoon
Apicomplexa
Haemosporida
File name
File size
Uniform resource locator/link to file
File format
Molecular Biology
59999 Environmental Sciences not elsewhere classified
69999 Biological Sciences not elsewhere classified
Science Policy
Inorganic Chemistry
Developmental Biology
Plant Biology
60506 Virology
Medicine
Computational Biology
Evolutionary Biology
12 days
MS
column.unigene.fasta files
day.unigene.fasta
transcriptome assembly
7 days
4 days
fb.flower bud.Unigene.fa
equestri
L 5.root
L 6.stem
Uncategorized
journal
Uncategorised
Uncategorized
Diseases
80699 Information Systems not elsewhere classified
69999 Biological Sciences not elsewhere classified
110309 Infectious Diseases
Developmental Biology
60506 Virology
Microbiology
Genetics
Evolutionary Biology
Molecular Biology
Biochemistry
small
featured
society and social sciences > social sciences > linguistics
Computational Biology
Neuroscience
Diseases
Uncategorised
Uncategorized
Hydrology
Soil Science
60102 Bioinformatics
Cancer
otorhinolaryngologic diseases
small
education
small
Data Format
Computational Biology
Bioinformatics
60408 Genomics
60405 Gene Expression (incl. Microarray and other genome-wide approaches)
30307 Theory and Design of Materials
30306 Synthesis of Materials
91299 Materials Engineering not elsewhere classified
Process Design
Conventional Power Plants
80107 Natural Language Processing
Cell Biology
small
featured
analysis > nlp
technology and applied sciences > computing > internet
technology and applied sciences > computing > internet > twitter
society and social sciences > social sciences > linguistics
InformationSystems_MISCELLANEOUS
society and social sciences > social sciences
small
featured
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
ComputingMilieux_MISCELLANEOUS
ComputingMilieux_COMPUTERSANDSOCIETY
ComputingMilieux_LEGALASPECTSOFCOMPUTING
ComputingMilieux_THECOMPUTINGPROFESSION
society and social sciences > society > crime
featured
ComputingMilieux_COMPUTERSANDEDUCATION
large
society and social sciences > society > education
medium
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
natural and physical sciences > biology > health sciences > public health > healthcare
ComputingMethodologies_PATTERNRECOGNITION
small
health and fitness > self care > exercise > sports > basketball
small
ComputingMethodologies_PATTERNRECOGNITION
technology and applied sciences > computing > internet
society and social sciences > social sciences > linguistics
Data_MISCELLANEOUS
Inertial measurement unit
multimodal sensor input
90602 Control Systems
Robotics and Automation
grasps
depth imaging
RGB images
90305 Rehabilitation Engineering
Grasping Activities
featured
small
technology and applied sciences > computing > programming
TheoryofComputation_LOGICSANDMEANINGSOFPROGRAMS
ComputingMilieux_PERSONALCOMPUTING
Software_SOFTWAREENGINEERING
80301 Bioinformatics Software
Applied Computer Science
Computer Software
80105 Expert Systems
80702 Health Informatics
80108 Neural
Evolutionary and Fuzzy Computation
80109 Pattern Recognition and Data Mining
featured
medium
society and social sciences > society > finance
medium
machine learning > deep learning
natural and physical sciences > nature > animals
Muinchille
Drung
Cootehill
Drong
80705 Informetrics
chi index h-index
global surface temperature
land surface air temperature
Sea surface temperature
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SNOW/ICE > SNOW DEPTH
EARTH SCIENCE > LAND SURFACE > TOPOGRAPHY > SURFACE ROUGHNESS
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > PRECIPITATION AMOUNT
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > WIND DYNAMICS > VERTICAL WIND VELOCITY/SPEED
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > UPPER LEVEL WINDS > U/V WIND COMPONENTS
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > SURFACE WINDS > U/V WIND COMPONENTS
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WATER VAPOR > WATER VAPOR INDICATORS > HUMIDITY > SPECIFIC HUMIDITY
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WATER VAPOR > WATER VAPOR INDICATORS > HUMIDITY > RELATIVE HUMIDITY
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC TEMPERATURE > UPPER AIR TEMPERATURE
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC TEMPERATURE > SURFACE TEMPERATURE > AIR TEMPERATURE
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC PRESSURE > HYDROSTATIC PRESSURE
EARTH SCIENCE > ATMOSPHERE > ALTITUDE > GEOPOTENTIAL HEIGHT
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC RADIATION > RADIATIVE FLUX
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC PRESSURE > SEA LEVEL PRESSURE
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC TEMPERATURE > SURFACE TEMPERATURE > MAXIMUM/MINIMUM TEMPERATURE
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WATER VAPOR > WATER VAPOR PROCESSES > SUBLIMATION
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > PRECIPITATION PROFILES > LATENT HEAT FLUX
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > WIND DYNAMICS > WIND SHEAR > VERTICAL WIND SHEAR
EARTH SCIENCE > ATMOSPHERE > CLOUDS > CLOUD PROPERTIES > CLOUD BASE HEIGHT
EARTH SCIENCE > ATMOSPHERE > CLOUDS > CLOUD PROPERTIES > CLOUD FRACTION
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > ACCUMULATIVE CONVECTIVE PRECIPITATION
EARTH SCIENCE > ATMOSPHERE > CLOUDS > CLOUD MICROPHYSICS > CLOUD PRECIPITABLE WATER
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WATER VAPOR > WATER VAPOR PROCESSES > EVAPOTRANSPIRATION
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > SOLID PRECIPITATION > ICE PELLETS
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > LIQUID PRECIPITATION > RAIN > FREEZING RAIN
EARTH SCIENCE > CLIMATE INDICATORS > CRYOSPHERIC INDICATORS > ICE DEPTH/THICKNESS
EARTH SCIENCE > CLIMATE INDICATORS > CRYOSPHERIC INDICATORS > SNOW COVER
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > SOLID PRECIPITATION > SNOW
EARTH SCIENCE > ATMOSPHERE > WEATHER EVENTS > Stability/Severe Weather Indices > CONVECTIVE AVAILABLE POTENTIAL ENERGY (CAPE)
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > LIQUID PRECIPITATION > RAIN
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SURFACE WATER > SURFACE WATER PROCESSES/MEASUREMENTS > RUNOFF
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > PRECIPITATION RATE
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > SNOW WATER EQUIVALENT
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC PRESSURE > PLANETARY BOUNDARY LAYER HEIGHT
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC TEMPERATURE > SURFACE TEMPERATURE > POTENTIAL TEMPERATURE
EARTH SCIENCE > CRYOSPHERE > SEA ICE > HEAT FLUX
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC CHEMISTRY > OXYGEN COMPOUNDS > OZONE
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > WIND DYNAMICS > VORTICITY > POTENTIAL VORTICITY
EARTH SCIENCE > BIOSPHERE > VEGETATION > VEGETATION COVER
Paleoecology
small
featured
society and social sciences > social sciences > linguistics
Sociology
culture and arts > culture and humanities > languages
Politics
Science of education
Education
The Netherlands
Behavioural sciences
Socio-cultural sciences
society and social sciences > social sciences > demographics
geography and places > world
Health sciences
Psychology
Demography
Temporal coverage: 2012 December
Society and social systems
Social attitudes and values
Leisure and recreation studies
Health and well-being
Demography and population
Leisure
recreation and culture
Housing and household
Social behavior
Religion
Environment
Social sciences
analysis > survey analysis
Ecology
Quantitative Biology::Populations and Evolution
medium
dictionaries
encyclopedias
glossaries)
ComputingMethodologies_PATTERNRECOGNITION
GeneralLiterature_REFERENCE(e.g.
ComputerSystemsOrganization_SPECIAL-PURPOSEANDAPPLICATION-BASEDSYSTEMS
featured
medium
society and social sciences > society > crime
geography and places > north america > united states
society and social sciences > society > crime > violence
mathematics and logic > statistics > time series
natural and physical sciences > earth sciences > geography
society and social sciences > society > crime > illegal drugs
Data_MISCELLANEOUS
ComputingMethodologies_PATTERNRECOGNITION
Data Format
Yelp
80505 Web Technologies (excl. Web Search)
Computer Software
Data Format
80404 Markup Languages
80306 Open Software
80602 Computer-Human Interaction
Library and Information Studies
Crystallography
Neuroscience
Imaging
59999 Environmental Sciences not elsewhere classified
69999 Biological Sciences not elsewhere classified
Space Science
Genetics
Evolutionary Biology
80699 Information Systems not elsewhere classified
energy limitation
fungi
density compensation
species interaction
Neotropics
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC TEMPERATURE > UPPER AIR TEMPERATURE
EARTH SCIENCE > ATMOSPHERE > ALTITUDE > GEOPOTENTIAL HEIGHT
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > UPPER LEVEL WINDS
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > SURFACE WINDS
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WATER VAPOR > WATER VAPOR INDICATORS > DEW POINT TEMPERATURE
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC PRESSURE > ATMOSPHERIC PRESSURE MEASUREMENTS
EARTH SCIENCE > Atmosphere > Atmospheric Pressure > Sea Level Pressure
EARTH SCIENCE > Atmosphere > Atmospheric Pressure > Surface Pressure
EARTH SCIENCE > Oceans > Ocean Pressure > Sea Level Pressure
Dependability
Reliability
80501 Distributed and Grid Systems
Anomaly detection
Logs
Log analysis
comets
File format
File name
File size
Uniform resource locator/link to file
Arctic Ocean
Dynamic Ocean Topography
Geostrophic Currents
Ocean Modeling
Principal Component Analysis
Satellite altimetry
Variations in ocean currents
sea ice concentration
and sea surface temperature along the North-East coast of Greenland (NEG-OCEAN)
Ecology
Cell Biology
Developmental Biology
60506 Virology
Medicine
Evolutionary Biology
Marine Biology
Adaptation
Speciation
Silene dioica
present time
Switzerland
Silene latifolia
Hematology
reproductive barrier
Economics
accounting network
banks
graphml
balancesheets
communities
Ecology
69999 Biological Sciences not elsewhere classified
Science Policy
110309 Infectious Diseases
19999 Mathematical Sciences not elsewhere classified
generation time
pre-breeding condition
Mus domesticus
overdominance
t haplotype
intergenerational costs
intermittent breeding
Perisoreus infaustus
reproductive costs
life-history
intragenomic conflict
Palearctic
intragenerational costs
t frequency paradox
"BIOLOGICAL CLASSIFICATION"
"ANIMALS/INVERTEBRATES"
"MOLLUSKS"
Biomineralization
Shell Matrix Proteins
SMPs
Shell formation
featured
culture and arts > culture and humanities > food and drink
large
Cancer
Paleoecology
Diet and health
Medical science and disease
ComputingMethodologies_PATTERNRECOGNITION
Data_MISCELLANEOUS
90302 Biomechanical Engineering
Innervation zone
Image-based clustering
Graph-Cut segmentation
electromyography
Environmental Science
small
GeneralLiterature_MISCELLANEOUS
small
small
featured
natural and physical sciences > physical sciences > space
Astrophysics::Earth and Planetary Astrophysics
Physics::Space Physics
natural and physical sciences > physical sciences > astronomy
Physics::Geophysics
Astrophysics::Solar and Stellar Astrophysics
80104 Computer Vision
170299 Cognitive Science not elsewhere classified
Genetics
Data_FILES
60205 Marine and Estuarine Ecology (incl. Marine Ichthyology)
Biological Techniques
Zoology
Paleoecology
medium
featured
ComputingMethodologies_PATTERNRECOGNITION
technology and applied sciences > computing > internet
society and social sciences > social sciences > linguistics
small
small
technology and applied sciences > medicine
natural and physical sciences > biology > health sciences > public health > healthcare > surgery
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
90602 Control Systems
Robotics and Automation
80104 Computer Vision
Molecular Biology
80699 Information Systems not elsewhere classified
69999 Biological Sciences not elsewhere classified
Cancer
110309 Infectious Diseases
Evolutionary Biology
gene flow
asymmetric introgression
ABC
demographic history
newt
Lissotriton
Since Miocene
Central Europe
Molecular Biology
80101 Adaptive Agents and Intelligent Robotics
69999 Biological Sciences not elsewhere classified
Cancer
110309 Infectious Diseases
Cell Biology
Plant Biology
Biochemistry
60506 Virology
Medicine
Computational Biology
Genetics
Hematology
Root architecture and plasticity
Sinorhizobium meliloti
Rhizobia responses
Nitrogen responses
Cell identity
Plant responses to the environment
Root cell types
Arabidopsis thaliana
Fluorescence Activated Cell Sorting
Molecular Biology
80699 Information Systems not elsewhere classified
59999 Environmental Sciences not elsewhere classified
Ecology
69999 Biological Sciences not elsewhere classified
Cancer
Science Policy
Developmental Biology
Immunology
Computational Biology
Macroevolution
Hematology
Directional Evolution
Mollusca
geomorph
Geometric Morphometrics
Pectinidae
59999 Environmental Sciences not elsewhere classified
Ecology
69999 Biological Sciences not elsewhere classified
Cancer
Science Policy
Evolutionary Biology
Physiology
humanities
education
Galliformes
Body mass
Anatidae
urologic and male genital diseases
Birds
Anseriformes
Galloanserae
Herbivory
Diet
small
featured
human activities
health care economics and organizations
natural and physical sciences > biology > health sciences > public health > healthcare
equipment and supplies
health services administration
population characteristics
small
GeneralLiterature_MISCELLANEOUS
ComputingMethodologies_PATTERNRECOGNITION
society and social sciences > society > business
health and fitness > self care > exercise > sports
ComputingMilieux_PERSONALCOMPUTING
featured
society and social sciences > society > politics
society and social sciences > social sciences > linguistics
small
technology and applied sciences > computing > internet
medium
small
audience > beginner
analysis > data cleaning
algorithms > linear regression
people and self > personal life > housing
society and social sciences > society > real estate
featured
small
geography and places > asia > india
natural and physical sciences > biology > health sciences > public health > healthcare
natural and physical sciences > biology > health sciences
medium
large
Data_FILES
general reference > research tools and topics > databases
Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION
110999 Neurosciences not elsewhere classified
Microbiology
110303 Clinical Microbiology
110307 Gastroenterology and Hepatology
110307 Gastroenterology and Hepatology
File format
File size
Uniform resource locator/link to file
File name
small
medium
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
ComputingMethodologies_PATTERNRECOGNITION
ComputingMilieux_MISCELLANEOUS
InformationSystems_DATABASEMANAGEMENT
Data_MISCELLANEOUS
Hydrology
80110 Simulation and Modelling
40105 Climatology (excl. Climate Change Processes)
90509 Water Resources Engineering
40608 Surfacewater Hydrology
40604 Natural Hazards
Climate Science
small
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
medium
medium
featured
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
general reference > research tools and topics > knowledge
general reference > reference works > encyclopedias
Hydrology
80110 Simulation and Modelling
40105 Climatology (excl. Climate Change Processes)
40608 Surfacewater Hydrology
Oceanography
90905 Photogrammetry and Remote Sensing
40104 Climate Change Processes
90902 Geodesy
ComputingMethodologies_PATTERNRECOGNITION
Neuroscience
Computer Science
Linguistics
120507 Urban Analysis and Development
Computational Biology
Data_FILES
Biological Techniques
small
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
small
analysis > nlp
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
data type > text data
featured
medium
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
society and social sciences > society > business
EARTH SCIENCE > LAND SURFACE > EROSION/SEDIMENTATION > EROSION
EARTH SCIENCE > PALEOCLIMATE > PALEOCLIMATE RECONSTRUCTIONS > DROUGHT/PRECIPITATION RECONSTRUCTION
EARTH SCIENCE > SOLID EARTH > GEOCHEMISTRY > GEOCHEMICAL PROPERTIES > ISOTOPE RATIOS
Mapping tool
EARTH SCIENCE > PALEOCLIMATE > OCEAN/LAKE RECORDS > ISOTOPES
EARTH SCIENCE > LAND SURFACE > EROSION/SEDIMENTATION > SEDIMENT TRANSPORT
Neodymium radioisotopes
EARTH SCIENCE SERVICES > DATA ANALYSIS AND VISUALIZATION > STATISTICAL APPLICATIONS
EARTH SCIENCE > PALEOCLIMATE > LAND RECORDS > SEDIMENTS
Strontium radioisotopes
nanozyme assay
silver nanoparticles
catalytic activity
SERRS
small
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
society and social sciences > social sciences > linguistics
60302 Biogeography and Phylogeography
Uncategorized
featured
culture and arts > culture and humanities > food and drink
small
digestive
oral
and skin physiology
food and beverages
small
Physics::Instrumentation and Detectors
Nuclear Experiment
Computer Science::Mathematical Software
79901 Agricultural Hydrology (Drainage
Flooding
Irrigation
Quality
etc.)
Molecular Biology
80699 Information Systems not elsewhere classified
69999 Biological Sciences not elsewhere classified
Cancer
110309 Infectious Diseases
Inorganic Chemistry
Neuroscience
Developmental Biology
Plant Biology
60506 Virology
Marine Biology
Data_FILES
Ecological Genetics
Hydrology
Conservation Genetics
140201 Agricultural Economics
Population Genetics - Theoretical
Agricultural economics
70104 Agricultural Spatial Analysis and Modelling
Climate Change Impact
Population Ecology
59999 Environmental Sciences not elsewhere classified
Ecology
69999 Biological Sciences not elsewhere classified
Plant Biology
functional diversity
USA
alpha diversity
Illinois
temporal dynamics
beta diversity
fire management
ecological restoration
Midwest USA
environmental filtering
Chicago
community assembly
60309 Phylogeny and Comparative Analysis
Data_FILES
50206 Environmental Monitoring
small
Computer Engineering
r language
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
ComputingMethodologies_PATTERNRECOGNITION
Computer Science::Computer Vision and Pattern Recognition
Astrophysics::Galaxy Astrophysics
Computer Engineering
130309 Learning Sciences
Data_MISCELLANEOUS
Computer Science::Multimedia
dictionaries
encyclopedias
glossaries)
Neuroscience
Data_MISCELLANEOUS
GeneralLiterature_REFERENCE(e.g.
featured
medium
small
health and fitness > self care > exercise > sports > american football
InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
InformationSystems_GENERAL
small
data type > image data
Computer Science::Computer Vision and Pattern Recognition
large
Mathematics::Geometric Topology
small
ComputingMilieux_GENERAL
170299 Cognitive Science not elsewhere classified
small
Cancer
110309 Infectious Diseases
Cell Biology
Developmental Biology
Plant Biology
60506 Virology
Immunology
Microbiology
Computational Biology
Genetics
39999 Chemical Sciences not elsewhere classified
Hematology
Root architecture and plasticity
Sinorhizobium meliloti
Rhizobia responses
Nitrogen responses
Cell identity
Plant responses to the environment
Root cell types
Arabidopsis thaliana
Fluorescence Activated Cell Sorting
small
society and social sciences > society > finance
philosophy and thinking > philosophy > history
80309 Software Engineering
Molecular Biology
80699 Information Systems not elsewhere classified
69999 Biological Sciences not elsewhere classified
Genetics
Evolutionary Biology
Palaeoptera
Bayesian phylogenetics
BEAST
Pterygota
Metapterygota
Chiastomyaria
featured
data type > image data
medium
natural and physical sciences > biology > health sciences > public health > healthcare
respiratory system
respiratory tract diseases
large
69999 Biological Sciences not elsewhere classified
Cancer
Science Policy
Pharmacology
Inorganic Chemistry
Biotechnology
Medicine
Genetics
80699 Information Systems not elsewhere classified
19999 Mathematical Sciences not elsewhere classified
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SURFACE WATER > DISCHARGE/FLOW
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SURFACE WATER > RIVERS/STREAMS
In Situ/Laboratory Instruments > Conductivity Sensors > CONDUCTIVITY METERS
In Situ/Laboratory Instruments > Gauges > STREAM GAUGES
Catchments as Organised Systems
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SURFACE WATER > HYDROPATTERN
CAOS
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SURFACE WATER > DRAINAGE
In Situ/Laboratory Instruments > Photon/Optical Detectors > Cameras > CAMERA
70103 Agricultural Production Systems Simulation
small
Artificial Intelligence and Image Processing
audience > beginner
analysis > data visualization
weka
Big Data
Uncategorised
Uncategorized
small
problem type > multiclass classification
machine learning > classification
problem type > binary classification
algorithms > xgboost
machine learning > model comparison
machine learning > feature engineering
society and social sciences > society > finance > banking
algorithms > svm
algorithms > logistic regression
59999 Environmental Sciences not elsewhere classified
Ecology
Pharmacology
Inorganic Chemistry
Biochemistry
60506 Virology
Immunology
Medicine
Computational Biology
Space Science
Genetics
Evolutionary Biology
fungi
food and beverages
genetic processes
animal diseases
featured
ComputingMilieux_COMPUTERSANDEDUCATION
large
society and social sciences > society > education
small
medium
120104 Architectural Science and Technology (incl. Acoustics
Lighting
Structure and Ecologically Sustainable Design)
small
featured
society and social sciences > society > finance
mathematics and logic > statistics > time series
natural and physical sciences > biology > health sciences > public health > healthcare
natural and physical sciences > biology > health sciences > public health
society and social sciences > social sciences > demographics
society and social sciences > society > government
general reference > research tools and topics > government agencies
small
small
small
ComputingMethodologies_PATTERNRECOGNITION
medium
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
culture and arts > visual arts > photography
technology and applied sciences > computing > human-computer interaction
technology and applied sciences > electronics > digital media
medium
stomatognathic diseases
small
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > PRECIPITATION AMOUNT
featured
large
health and fitness > self care > positive psychology > mental health
humanities
natural and physical sciences > biology > health sciences > public health
Data Format
80499 Data Format not elsewhere classified
80403 Data Structures
160511 Research
Science and Technology Policy
100504 Data Communications
91305 Energy Generation
Conversion and Storage Engineering
160808 Sociology and Social Studies of Science and Technology
80703 Human Information Behaviour
80699 Information Systems not elsewhere classified
69999 Biological Sciences not elsewhere classified
Cancer
Biochemistry
19999 Mathematical Sciences not elsewhere classified
Dataset
110309 Infectious Diseases
Supplemental
59999 Environmental Sciences not elsewhere classified
69999 Biological Sciences not elsewhere classified
Cancer
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
Biotechnology
Biochemistry
Genetics
80699 Information Systems not elsewhere classified
19999 Mathematical Sciences not elsewhere classified
reductionism
mechanistic simulation
bibliographic network
science integration
systems science
individual-based model
80106 Image Processing
80103 Computer Graphics
Plant Biology
ComputingMethodologies_PATTERNRECOGNITION
Data_MISCELLANEOUS
bioenergy crop plantings
70304 Crop and Pasture Biomass and Bioproducts
biomass C
70306 Crop and Pasture Nutrition
small
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
80106 Image Processing
Fourier Optics
Image and Signal Processing
Law
130306 Educational Technology and Computing
data protection
privacy
150306 Industrial Relations
employee surveillance
lecture capture
small
Environmental Science
69999 Biological Sciences not elsewhere classified
110309 Infectious Diseases
Cell Biology
60506 Virology
Genetics
Evolutionary Biology
80699 Information Systems not elsewhere classified
kinship
inbreeding
temperate rainforests
Conservation genetics and biodiversity
South America
conservation genetics
habitat fragmentation
Molecular Biology
80699 Information Systems not elsewhere classified
Ecology
69999 Biological Sciences not elsewhere classified
110309 Infectious Diseases
Inorganic Chemistry
Computational Biology
Genetics
Evolutionary Biology
ComputingMethodologies_PATTERNRECOGNITION
Data_MISCELLANEOUS
coalescence
RAD-Seq
speciation
Genotyping-by-sequencing
Western Mediterranean
concatenation
Linaria
radiation
Quaternary
Iberian Peninsula
phylogeny
111706 Epidemiology
119999 Medical and Health Sciences not elsewhere classified
Health Care
small
featured
culture and arts > culture and humanities > food and drink
Ecology
60205 Marine and Estuarine Ecology (incl. Marine Ichthyology)
50202 Conservation and Biodiversity
medium
medium
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
data type > image data
problem type > future prediction
Biological Sciences
170203 Knowledge Representation and Machine Learning
60102 Bioinformatics
80109 Pattern Recognition and Data Mining
Evolutionary Biology
Paleoecology
small
featured
medium
ComputingMilieux_PERSONALCOMPUTING
health and fitness > self care > exercise > sports > association football
geography and places > europe
Brain
small
ComputingMilieux_PERSONALCOMPUTING
culture and arts > games and toys > video games
Biomarkers
110201 Cardiology (incl. Cardiovascular Diseases)
Ecology
Cancer
Science Policy
Pharmacology
Biotechnology
Plant Biology
Evolutionary Biology
19999 Mathematical Sciences not elsewhere classified
Marine Biology
Inorganic Chemistry
body size
microevolution
Selection
small
featured
featured
medium
society and social sciences > society > finance
philosophy and thinking > philosophy > history
medium
society and social sciences > society > crime
small
small
91299 Materials Engineering not elsewhere classified
30307 Theory and Design of Materials
30306 Synthesis of Materials
40604 Natural Hazards
Solid Earth Sciences
40407 Seismology and Seismic Exploration
Neuroscience
behavioral disciplines and activities
nervous system
genetic structures
psychological phenomena and processes
small
featured
society and social sciences > social sciences > economics
trade
balance of payments
exchange rates
interest rates
government expenditures
monetary reserves
international economics
government revenues
financial policy
featured
data type > image data
medium
problem type > multiclass classification
small
people and self > personal life > clothing
problem type > object identification
Cancer
Molecular Biology
Ecology
110309 Infectious Diseases
North America
Biochemistry
Mammalia
Microbiology
Computational Biology
Evolutionary Biology
Price equation
Species selection
Palaeocene
Eocene
Macroevolution
Body size
Palaeocene/Eocene boundary
Wyoming
Bighorn Basin
Clarks Fork Basin
small
Plant Biology
Computer Science::Computer Vision and Pattern Recognition
Computer Science::Neural and Evolutionary Computation
fungi
urologic and male genital diseases
Physics::Accelerator Physics
urogenital system
cardiovascular diseases
test datasets
female genital diseases and pregnancy complications
small
general reference > research tools and topics > books
society and social sciences > social sciences > linguistics
culture and arts > culture and humanities > languages
medium
featured
society and social sciences > social sciences > linguistics
ComputingMilieux_THECOMPUTINGPROFESSION
people and self > personal life > employment
small
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
geography and places > asia > india
society and social sciences > society > crime
small
80799 Library and Information Studies not elsewhere classified
small
110309 Infectious Diseases
60506 Virology
Medicine
Reptiles
Population Genetics - Empirical
Speciation
virology/viral replication and gene regulation
exon capture
Top End
Northern Australia
Carlia gracilis
111714 Mental Health
virology/viruses and cancer
Carlia amax
SNP
Virology
virology/persistence and latency
virology/effects of virus infection on host gene expression
Kimberley
snow cover maps
small
cardiovascular diseases
Paleoecology
120504 Land Use and Environmental Planning
Developmental Biology
Environmental Science
Cell Biology
small
culture and arts > games and toys
ComputingMilieux_PERSONALCOMPUTING
culture and arts > games and toys > video games
small
ComputerSystemsOrganization_PROCESSORARCHITECTURES
small
GeneralLiterature_MISCELLANEOUS
health and fitness > self care > exercise > sports > fishing
small
ComputerApplications_COMPUTERSINOTHERSYSTEMS
initiatives > socrata
110704 Cellular Immunology
110203 Respiratory Diseases
80699 Information Systems not elsewhere classified
59999 Environmental Sciences not elsewhere classified
69999 Biological Sciences not elsewhere classified
Cancer
Science Policy
110309 Infectious Diseases
Pharmacology
Developmental Biology
Plant Biology
Biochemistry
60506 Virology
Immunology
Genetics
Evolutionary Biology
19999 Mathematical Sciences not elsewhere classified
Molecular Biology
Agriculture
Insects
Quaternary
111714 Mental Health
Biotechnology
reaction norm
forensic
probe design
Brittany
frugivory
48°36´N
seed dispersal
Cytomegalovirus
serology
forest restoration
release Version 3.0
Western France
multicellularity
Community Ecology
Raw Data
tuberculosis
understory fires
Myxococcus xanthus
Aphididae
phenotypic plasticity
version
Aphidiinae
http
Brazil
Zone Atelier Armorique
natural regeneration
Drosophila genome
gene sequences
CVD
Epidemiology
Foodwebs
HCMV
Amazonia
physical forces in development
population genetics
Cardiovascular disease
1°32´W
DNA Barcoding
TB
non-CODIS STRs
txt
Tapirus terrestris
Agilent
Chinese Kyrgyz
Uganda
small
sediments disturbance porosity permeability
Uncategorised
Uncategorized
large
medium
Neuroscience
Imaging
medium
technology and applied sciences > computing > internet
EARTH SCIENCE > LAND SURFACE > TOPOGRAPHY > SURFACE ROUGHNESS
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > PRECIPITATION AMOUNT
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > WIND DYNAMICS > VERTICAL WIND VELOCITY/SPEED
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC TEMPERATURE > UPPER AIR TEMPERATURE
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC TEMPERATURE > SURFACE TEMPERATURE > AIR TEMPERATURE
EARTH SCIENCE > ATMOSPHERE > ALTITUDE > GEOPOTENTIAL HEIGHT
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC PRESSURE > SEA LEVEL PRESSURE
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > SOLID PRECIPITATION > SNOW
EARTH SCIENCE > BIOSPHERE > VEGETATION > VEGETATION COVER
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > SURFACE WINDS
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WATER VAPOR > WATER VAPOR INDICATORS > DEW POINT TEMPERATURE
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SURFACE WATER > SURFACE WATER PROCESSES/MEASUREMENTS > RUNOFF
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SNOW/ICE > SNOW/ICE TEMPERATURE
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SNOW/ICE > SNOW MELT
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SNOW/ICE > SNOW DEPTH
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SNOW/ICE > SNOW DENSITY
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SNOW/ICE > ALBEDO
EARTH SCIENCE > OCEANS > SEA ICE > ICE EXTENT
EARTH SCIENCE > OCEANS > OCEAN TEMPERATURE > SEA SURFACE TEMPERATURE
EARTH SCIENCE > LAND SURFACE > TOPOGRAPHY > TERRAIN ELEVATION
EARTH SCIENCE > LAND SURFACE > SURFACE THERMAL PROPERTIES > SKIN TEMPERATURE
EARTH SCIENCE > LAND SURFACE > SURFACE RADIATIVE PROPERTIES > ALBEDO
EARTH SCIENCE > LAND SURFACE > SOILS > SOIL TEMPERATURE
EARTH SCIENCE > LAND SURFACE > SOILS > SOIL MOISTURE/WATER CONTENT
EARTH SCIENCE > LAND SURFACE > SOILS > SOIL CLASSIFICATION
EARTH SCIENCE > CRYOSPHERE > SNOW/ICE > SNOW/ICE TEMPERATURE
EARTH SCIENCE > BIOSPHERE > VEGETATION > VEGETATION SPECIES
EARTH SCIENCE > ATMOSPHERE > CLOUDS > CLOUD PROPERTIES > CLOUD FREQUENCY
EARTH SCIENCE > ATMOSPHERE > CLOUDS > CLOUD MICROPHYSICS > CLOUD LIQUID WATER/ICE
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > WIND DYNAMICS > WIND STRESS
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > WIND DYNAMICS > VORTICITY
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > WIND DYNAMICS > CONVERGENCE
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > WIND DYNAMICS > CONVECTION
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WINDS > UPPER LEVEL WINDS
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WATER VAPOR > WATER VAPOR PROCESSES > EVAPORATION
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WATER VAPOR > WATER VAPOR INDICATORS > WATER VAPOR
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WATER VAPOR > WATER VAPOR INDICATORS > TOTAL PRECIPITABLE WATER
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WATER VAPOR > WATER VAPOR INDICATORS > HUMIDITY
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC RADIATION > SHORTWAVE RADIATION
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC RADIATION > OUTGOING LONGWAVE RADIATION
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC RADIATION > LONGWAVE RADIATION
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC RADIATION > INCOMING SOLAR RADIATION
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC RADIATION > HEAT FLUX
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC PRESSURE > SURFACE PRESSURE
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC PRESSURE > GRAVITY WAVE
EARTH SCIENCE > ATMOSPHERE > ALTITUDE > PLANETARY BOUNDARY LAYER HEIGHT
EARTH SCIENCE > ATMOSPHERE > AIR QUALITY > TROPOSPHERIC OZONE
Molecular Biology
80699 Information Systems not elsewhere classified
69999 Biological Sciences not elsewhere classified
Science Policy
ComputingMethodologies_PATTERNRECOGNITION
Data_FILES
Data_MISCELLANEOUS
Iminium reactive intermediates
Abemaciclib
Side effects
Reactive metabolites
featured
small
society and social sciences > society > war
culture and arts > arts and entertainment > literature
people and self > people > social groups
small
fungi
urologic and male genital diseases
130306 Educational Technology and Computing
urogenital system
cardiovascular diseases
female genital diseases and pregnancy complications
69999 Biological Sciences not elsewhere classified
Pliocene
Pleistocene
Genetics
Evolutionary Biology
19999 Mathematical Sciences not elsewhere classified
integumentary system
parasitic diseases
amphibians
fungi
Southeastern U.S.
hybrid enrichment
Lithobates sphenocephalus
discordance
Hyla squirella
Hyla cinerea
mitogenome
Rana sphenocephala
frogs
Anaxyrus terrestris
barrier testing
phylogenomics
59999 Environmental Sciences not elsewhere classified
Ecology
69999 Biological Sciences not elsewhere classified
Cancer
Medicine
39999 Chemical Sciences not elsewhere classified
Sociology
sexual cycling
Amboseli Park
Kenya
Papio cynocephalus
1977-2014
steroid hormones
postpartum amenorrhea
gestation
P. anubis
body fat
80403 Data Structures
datasetsR
69999 Biological Sciences not elsewhere classified
Evolutionary Biology
Coral Triangle
Tropical shallow-marine biodiversity
Western Pacific
Indo-Australian Archipelago
Temporal diversity dynamics
Cenozoic
Latitudinal diversity gradients
Biodiversity hotspot
Ostracoda
Ecology
60205 Marine and Estuarine Ecology (incl. Marine Ichthyology)
20299 Atomic
Molecular
Nuclear
Particle and Plasma Physics not elsewhere classified
medium
GeneralLiterature_MISCELLANEOUS
ComputingMilieux_MISCELLANEOUS
people and self > personal life > hotels
GeneralLiterature_INTRODUCTORYANDSURVEY
InformationSystems_GENERAL
Temperature
DATE/TIME
ELEVATION
HEIGHT above ground
air
daily minimum
Method comment
LATITUDE
Description
daily mean
daily maximum
Station label
LONGITUDE
small
60102 Bioinformatics
Data_FILES
API
59999 Environmental Sciences not elsewhere classified
69999 Biological Sciences not elsewhere classified
Science Policy
data
study participants
membrane
experiment
assay
information
infectivity surveys
dataset
Cell Biology
29999 Physical Sciences not elsewhere classified
80699 Information Systems not elsewhere classified
speciation
111714 Mental Health
Stream Restoration
species delimitation
Fluvial Geomorphology
United States
Morphodynamics
Physical Geography
Dynastes
59999 Environmental Sciences not elsewhere classified
Ecology
Developmental Biology
Plant Biology
Immunology
Evolutionary Biology
Inorganic Chemistry
Hematology
Alps
Heterozygosity-fitness correlation
MHC
Capra ibex
bottleneck
Alpine ibex
Infectious kerato-conjunctivitis
medium
audience > beginner
Environmental Science
50204 Environmental Impact Assessment
50202 Conservation and Biodiversity
Forest
50209 Natural Resource Management
Food Security
Health
Impact evaluation
Dietary Diversity
15 Geothermal Energy
geothermal
Colorado
reconnaissance
shallow temperature survey
air photo lineaments
groundwater
geology
geologic map
geothermometry
map
Rico
Geodatabase
Dolores County
San Miguel County
Geochemistry
structural
point information
mines and prospects
travertine
land ownership
rico quadrangle
topographic
12 Built Environment and Design
air temperatures
Earth and Environmental Sciences
Heat waves
10 Technology
window position
Bedroom
Environmental Science
Genetics
Data_FILES
60405 Gene Expression (incl. Microarray and other genome-wide approaches)
60205 Marine and Estuarine Ecology (incl. Marine Ichthyology)
Biological Techniques
Zoology
featured
medium
ComputingMilieux_PERSONALCOMPUTING
health and fitness > self care > exercise > sports > association football
small
ComputingMilieux_COMPUTERSANDSOCIETY
ComputingMilieux_LEGALASPECTSOFCOMPUTING
small
medium
analysis > nlp
80106 Image Processing
Artificial Intelligence and Image Processing
80602 Computer-Human Interaction
80104 Computer Vision
80504 Ubiquitous Computing
featured
medium
Emotion Perception
Arabic Language
Speech Analysis
Molecular Biology
Cancer
Cell Biology
human activities
Neuroscience
Biochemistry
Genetics
Evolutionary Biology
parasitic diseases
distance decay
fungi
animal diseases
Quaternary
dispersal processes
the Hengduan Mountains
Scandentia
geographic isolation
Soricomorpha
reproductive and urinary physiology
species turnover
Erinaceomorpha
Lagomorpha and Rodentia
halving distance
69999 Biological Sciences not elsewhere classified
Medicine
Genetics
80699 Information Systems not elsewhere classified
19999 Mathematical Sciences not elsewhere classified
Speciation
Geography of Speciation
Phylogenetic Comparative Methods
Global
Approximate Bayesian Computation
medium
GeneralLiterature_MISCELLANEOUS
ComputingMethodologies_PATTERNRECOGNITION
data type > text data
Cancer
80106 Image Processing
Fourier Optics
Image and Signal Processing
Computer Engineering
60408 Genomics
humanities
kinship
behavior and behavior mechanisms
single nucleotide polymorphism
social sciences
sampling
medium
geography and places > asia > india
technology and applied sciences > transport > vehicles
mathematics and logic > mathematics > numbers
Uncategorised
Uncategorized
80699 Information Systems not elsewhere classified
59999 Environmental Sciences not elsewhere classified
69999 Biological Sciences not elsewhere classified
Science Policy
60506 Virology
Immunology
Evolutionary Biology
phylogenomics
Caenorhabditis
Strigamia
Oikopleura
Lottia
Ixodes
2 billion years
Tetranychus
Hydra
Capitella
Ciona
Danaus
Acyrthosiphon
Gasterosteus
Monosiga
phylogenetic conflict
Saccoglossus
Apis
Branchiostoma
Gallus
Anolis
Trichoplax
Brugia
Pinctada
Rhodnius
Bombus
Fugu
Tribolium
Mnemiopsis
Amphimedon
Homo
Daphnia
Strongylocentrotus
Xenopus
Latimeria
Nematostella
locus selection
Salpingoeca
long-branch attraction
Drosophila
Earth
Acropora
small
small
Ecology
50301 Carbon Sequestration Science
Climate Science
40104 Climate Change Processes
54 Environmental Sciences
ngee
ngee-arctic
barrow
alaska
soil characteristics
elements
organic carbon
organic matter
Molecular Biology
59999 Environmental Sciences not elsewhere classified
Ecology
69999 Biological Sciences not elsewhere classified
Neuroscience
Genetics
Evolutionary Biology
Dataset
369 individuals
formatted
Label
sample
Microsatellite genotype data Genotype information
GenAlEx
location
column headers
.6.502
diploid loci
small
ComputingMethodologies_PATTERNRECOGNITION
Data_MISCELLANEOUS
Ecology
Environmental Science
Soil Science
49999 Earth Sciences not elsewhere classified
50102 Ecosystem Function
Limnology
small
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
medium
featured
ComputerApplications_COMPUTERSINOTHERSYSTEMS
society and social sciences > social sciences > linguistics
general reference > research tools and topics > writing
small
small
ComputerApplications_COMPUTERSINOTHERSYSTEMS
ComputingMilieux_MISCELLANEOUS
40101 Atmospheric Aerosols
69999 Biological Sciences not elsewhere classified
Biochemistry
Microbiology
analysis
universality
Dataset II Dataset II
primer
Conversion and Storage Engineering
90607 Power and Energy Systems Engineering (excl. Renewable Power)
91305 Energy Generation
medium
small
featured
geography and places > north america > united states
society and social sciences > society > crime > violence
society and social sciences > society > crime
society and social sciences > society > crime > terrorism
small
featured
ComputingMilieux_PERSONALCOMPUTING
culture and arts > games and toys > video games
medium
skin and connective tissue diseases
Uniform resource locator/link to file
Comment
Longitude of event
Event label
Latitude of event
Station label
Elevation of event
Comment of event
Baseline Surface Radiation Network (BSRN)
WCRP/GEWEX
hyperspectral
VNIR
Tea
MODLIFE
File name
File size
Uniform resource locator/link to file
File format
Antarctica
Functional Ecology @ AWI (AWI_FuncEco)
Development of a CCAMLR Marine Protected Area in the Antarctic Weddell Sea (WSMPA)
File content
Weddell Sea
Marine Protected Area (MPA)
Porifera
Echinodermata
Database
outer membrane
Halanaerobiales
Source Code
Negativicutes
evolution
Firmicutes
small
problem type > multiclass classification
machine learning > classification
large
technology and applied sciences > computing > human-computer interaction
small
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
society and social sciences > society > business
society and social sciences > society > finance
geography and places > north america > united states
ComputingMilieux_MISCELLANEOUS
society and social sciences > society > organizations
technology and applied sciences > computing > companies
small
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
GeneralLiterature_MISCELLANEOUS
health and fitness > self care > exercise > sports
Data_GENERAL
health and fitness > self care > exercise > running
small
technology and applied sciences > computing > internet
society and social sciences > society > business
ComputingMilieux_MISCELLANEOUS
analysis > data visualization
analysis > data cleaning
mathematics and logic > statistics > categorical data
machine learning > classification
analysis > image processing
featured
data type > image data
medium
machine learning > deep learning
mathematics and logic > statistics > categorical data
Microbiology
110307 Gastroenterology and Hepatology
110303 Clinical Microbiology
ComputingMilieux_COMPUTERSANDEDUCATION
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
ComputingMilieux_LEGALASPECTSOFCOMPUTING
ComputingMethodologies_ARTIFICIALINTELLIGENCE
130306 Educational Technology and Computing
GeneralLiterature_INTRODUCTORYANDSURVEY
Education
130103 Higher Education
Neuroscience
Imaging
Temperature
DATE/TIME
air
coastal climate
Mediterranean
weather conditions
Humidity
relative
Wind speed
Wind direction description
gust
Pressure
atmospheric
Precipitation
Thermometer
Hygrometer
Anemometer
Barometer
Pluviometer
69999 Biological Sciences not elsewhere classified
110309 Infectious Diseases
Developmental Biology
Immunology
Computational Biology
Genetics
Molecular Biology
small
medium
60702 Plant Cell and Molecular Biology
food and beverages
natural sciences
60703 Plant Developmental and Reproductive Biology
Ecology
69999 Biological Sciences not elsewhere classified
Biochemistry
Computational Biology
ComputingMethodologies_PATTERNRECOGNITION
Data_MISCELLANEOUS
cortical entrainment
semantic processing
EEG
cocktail party
selective attention
natural speech
multisensory integration
70204 Animal Nutrition
70108 Sustainable Agricultural Development
70501 Agroforestry
Proximate Composition
fatty acids
70199 Agriculture
Land and Farm Management not elsewhere classified
70107 Farming Systems Research
Phenolic Compounds
Digestibility
chemical composition profiles
Leaf traits
Neuroscience
Imaging
featured
small
society and social sciences > social sciences > linguistics
people and self > self > gender
medium
InformationSystems_INFORMATIONINTERFACESANDPRESENTATION(e.g.
HCI)
ComputingMethodologies_PATTERNRECOGNITION
59999 Environmental Sciences not elsewhere classified
69999 Biological Sciences not elsewhere classified
Science Policy
Cell Biology
Microbiology
Inorganic Chemistry
Sociology
111714 Mental Health
Amboseli Park
Kenya
Papio cynocephalus
1977-2014
steroid hormones
postpartum amenorrhea
gestation
P. anubis
body fat
sexual cycling
ComputingMilieux_THECOMPUTINGPROFESSION
Computer Science
Social Web
small
ComputingMilieux_COMPUTERSANDEDUCATION
medium
medium
featured
culture and arts > games and toys
ComputingMilieux_PERSONALCOMPUTING
culture and arts > games and toys > video games
EARTH SCIENCE > Atmosphere > Atmospheric Pressure > Surface Pressure
EARTH SCIENCE > Atmosphere > Altitude > Geopotential Height
EARTH SCIENCE > Land Surface > Land Temperature > Skin Temperature
EARTH SCIENCE > Oceans > Ocean Temperature > Sea Surface Temperature
EARTH SCIENCE > Oceans > Sea Ice > Sea Ice Concentration
EARTH SCIENCE > Hydrosphere > Snow/Ice > Snow Water Equivalent
EARTH SCIENCE > Land Surface > Soils > Soil Temperature
EARTH SCIENCE > Land Surface > Soils > Soil Moisture/Water Content
EARTH SCIENCE > Atmosphere > Atmospheric Pressure > Sea Level Pressure
EARTH SCIENCE > Atmosphere > Atmospheric Winds > Upper Level Winds
EARTH SCIENCE > Atmosphere > Atmospheric Winds > Surface Winds
EARTH SCIENCE > Atmosphere > Atmospheric Winds > Boundary Layer Winds
EARTH SCIENCE > Atmosphere > Atmospheric Temperature > Air Temperature
EARTH SCIENCE > Atmosphere > Atmospheric Water Vapor > Humidity
featured
medium
ComputingMilieux_COMPUTERSANDEDUCATION
GeneralLiterature_MISCELLANEOUS
InformationSystems_INFORMATIONSYSTEMSAPPLICATIONS
society and social sciences > society > finance
small
natural and physical sciences > biology
machine learning > classification
problem type > regression
small
people and self > personal life > housing
problem type > future prediction
small
ComputingMilieux_COMPUTERSANDEDUCATION
ComputingMilieux_LEGALASPECTSOFCOMPUTING
ComputingMilieux_THECOMPUTINGPROFESSION
ComputingMethodologies_PATTERNRECOGNITION
Data_MISCELLANEOUS
Physics::Accelerator Physics
ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS
Computer Science::Performance
APS
Citation data
Physics::Fluid Dynamics
Computer Science::Networking and Internet Architecture
Applied Physics
MathematicsofComputing_DISCRETEMATHEMATICS
Computer Software
80109 Pattern Recognition and Data Mining
80403 Data Structures
80301 Bioinformatics Software
69999 Biological Sciences not elsewhere classified
Inorganic Chemistry
Plant Biology
Biochemistry
Genetics
Evolutionary Biology
80699 Information Systems not elsewhere classified
Phylogenomics
Plastid genome
Trebouxiophyceae
Chlorophyceae
Ulvophyceae
Prasinophyceae
Chlorellales
Chlorophyta
Pedinophyceae
Cancer
otorhinolaryngologic diseases
89999 Information and Computing Sciences not elsewhere classified
80612 Interorganisational Information Systems and Web Services
Library and Information Studies
80799 Library and Information Studies not elsewhere classified
80706 Librarianship
Cancer
160699 Political Science not elsewhere classified
Immunology
Data_MISCELLANEOUS
ComputingMethodologies_PATTERNRECOGNITION
Uncategorised
Uncategorized
Ecology
69999 Biological Sciences not elsewhere classified
Science Policy
Neuroscience
Genetics
cryptic female choice
ovarian fluid
genetic heterozygosity
sperm competition
embryo survival
small
ComputingMilieux_LEGALASPECTSOFCOMPUTING
ComputingMilieux_THECOMPUTINGPROFESSION
ComputingMilieux_GENERAL
large
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
culture and arts > visual arts > comics
culture and arts > visual arts > animation
small
Cell Biology
Developmental Biology
RNAi
muscle
screen
Drosophila melanogaster
small
medium
InformationSystems_MISCELLANEOUS
medium
69999 Biological Sciences not elsewhere classified
Science Policy
Inorganic Chemistry
Neuroscience
North America
ComputingMethodologies_PATTERNRECOGNITION
80699 Information Systems not elsewhere classified
Europe
Anthropocene
taxonomy
Leptogium
biogeography
Leptogium saturninum
lichen
revision
Statistics::Computation
macrolichen
cyanolichen
Mallotium
Collemataceae
UNFCCC
emissions data
small
small
featured
ComputingMilieux_PERSONALCOMPUTING
culture and arts > games and toys > board games
small
large
small
featured
GeneralLiterature_MISCELLANEOUS
ComputingMilieux_COMPUTERSANDSOCIETY
ComputingMilieux_THECOMPUTINGPROFESSION
society and social sciences > social sciences > economics
medium
featured
GeneralLiterature_REFERENCE(e.g.
dictionaries
encyclopedias
glossaries)
ComputingMilieux_MISCELLANEOUS
culture and arts > performing arts > film
culture and arts > visual arts
non-spherical sand particle
CARES
organic aerosol
40601 Geomorphology and Regolith and Landscape Evolution
rotation
CalNex
Atmospheric Sciences
volatility
40302 Extraterrestrial Geology
drag forces
Geology
40607 Surface Processes
particulate matter
CMAQ
Aging
SOAS
Diseases
small
featured
natural and physical sciences > physical sciences > physics
ORDINAL NUMBER
Symbol
Melting point
Boiling point
natural and physical sciences > physical sciences > chemistry
Name
Atomic weight
medium
analysis > nlp
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
society and social sciences > social sciences > linguistics
analysis > text mining
TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES
medium
ComputerApplications_COMPUTERSINOTHERSYSTEMS
problem type > customer value
medium
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
society and social sciences > social sciences > linguistics
InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
culture and arts > culture and humanities > languages
ComputingMethodologies_ARTIFICIALINTELLIGENCE
human activities
90399 Biomedical Engineering not elsewhere classified
parasitic diseases
health services administration
population characteristics
endocrine system diseases
Cancer
Pharmacology
Inorganic Chemistry
19999 Mathematical Sciences not elsewhere classified
Hybridization
Hematology
110309 Infectious Diseases
RADseq
Quercus
EARTH SCIENCE > TERRESTRIAL HYDROSPHERE > SNOW/ICE > SNOW DEPTH
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > PRECIPITATION AMOUNT
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC TEMPERATURE > SURFACE TEMPERATURE > AIR TEMPERATURE
EARTH SCIENCE > LAND SURFACE > SOILS > SOIL MOISTURE/WATER CONTENT
EARTH SCIENCE > ATMOSPHERE > ATMOSPHERIC WATER VAPOR > WATER VAPOR PROCESSES > EVAPORATION
medium
ComputingMilieux_MISCELLANEOUS
initiatives > socrata
59999 Environmental Sciences not elsewhere classified
Ecology
69999 Biological Sciences not elsewhere classified
Science Policy
Immunology
Temperate forest
Soil biota
Population abundance
Boreal forest
Soil respiration
Metabolic rate
Temperate grassland
Tundra
Soil community
Temperature sensitivity
Biome
Individual mass
Tropical forest
medium
Financial
Uncategorized
Computer Science
small
Medicine
Data_MISCELLANEOUS
ComputingMethodologies_PATTERNRECOGNITION
small
ComputingMilieux_PERSONALCOMPUTING
featured
natural and physical sciences > nature > plants
society and social sciences > social sciences > international relations
small
natural and physical sciences > nature > animals
natural and physical sciences > nature > environment
small
medium
170299 Cognitive Science not elsewhere classified
10401 Applied Statistics
Paleoecology
small
medium
featured
natural and physical sciences > biology
technology and applied sciences > medicine
education
featured
large
culture and arts > performing arts > film
technology and applied sciences > computing > artificial intelligence
featured
natural and physical sciences > biology
large
technology and applied sciences > medicine
data type > image data
featured
medium
ComputingMethodologies_PATTERNRECOGNITION
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
general reference > research tools and topics > writing
mathematics and logic > mathematics
public domain
peptidase
dictionary of the English language
authority control
burial
grave
cemetery
genealogy
medicine
molecular function
gene product
biological process
gene
cellular component
Membrane transport protein
botany
daylight saving time
IANA time zone
zoology
manga
anime
open access in France
architecture
video game
video game
video game
genealogy
genealogy
art
design
photography
astronomy
genealogy
Earth sciences
Canadians
baseball
board game
medicine
biology
heavy metal
shark attack
food
charitable organization
genealogy
medicine
lyrics
triangle center
celebrity
human anatomy
art
rare disease
geographic location
place name
biology
biodiversity informatics
score
astronomical catalog
protected area
geography
history
ornithology
chemistry
death
chemistry
UNIDROIT
private international law
United Nations Convention on Contracts for the International Sale of Goods
Jewish people
mixed martial arts
ZX Spectrum
chemical reaction
enzymes
nomenclature
classification system
enzyme activity
chemistry
zoology
environmental protection
maritime transport
genealogy
invasive species
open access in France
food
food labeling regulations
recycling codes
nutrition facts label
Saccharomyces cerevisiae
biology
Arabidopsis thaliana
film
television series
Caenorhabditis elegans
chemistry
CAS Registry Number
manga
anime
manhua
manhwa
Korean animation
donghua
algorithmics
algorithm
data structure
archaeology
Australian literature
biographical article
United States federal judge
bitterness
nursing
plant virus
death
Drosophila melanogaster
extrasolar planet
spectroscopy
human genome
protein
comics
telephony
open educational resource
protein
two-component regulatory system
Punjab
heritage
Punjabi culture
Sumerian
open access policy
open-access repository
open access
open access in Portugal
open access in Latin America
open-access journal
open access in Uruguay
open access in Argentina
open access in Chile
open access in Spain
open access in Mexico
open access in Brazil
open access in Peru
aging
open access in Latin America
open-access journal
open access in Brazil
geochemistry
taxation
gene product
biological network
metabolite
biological pathway
covered bridge
citizen science
bird
chemistry
Theatre of Poland
comics
gene
disease
sign language
Jewish studies
bibliography
Czech
track and field
long non-coding RNA
mineral
member of the French National Assembly
member of parliament
medicine
health care
listed building in the United Kingdom
data set
Welsh newspapers
biographical article
musician
music video
script typeface
Arabic numeral
transcription factor
Y chromosome haplotype
protein
Michaelis constant
drug
history of medicine
Christian hymn
media of Australia
history of Australia
OpenStreetMap
Australian rules football
Australian Football League
phenotype
genotype
racing automobile driver
auto racing
Poaceae
genealogy
hazardous substances
industrial safety
personal protective equipment
anime
Gymnospermae
nobility
academic genealogy
given name
parent
date of birth
Social Security number
Member of the Victorian Legislative Assembly
Member of the Victorian Legislative Council
open access monograph
history
basketball
genealogy
Greek mythology
cultural landscape
monument
shell corporation
tax noncompliance
mass spectrum
chemistry
Crocodile attack
Russian
cell line
biotechnology
cell biology
media studies
communication studies
film studies
cell biology
Schizosaccharomyces pombe
genetics
bookselling
philology
state school
school district
state education agency
Lord Byron
Lucas Cranach the Elder
botany
nomenclature
classification system
open data
public art
earthquake
seismology
database
spectroscopy
new media art
installation art
educational institution
research institute
medicine
library history
death
censorship
cultural heritage
botany
Danio rerio
member of the Parliament of Finland
Auvergne
library
museum
archives
airline
Wikipedia
wiki
biography
sentiment analysis
review
sentiment analysis
Internet Movie Database
sentiment analysis
film criticism
sentiment analysis
speech segmentation
chat room
Extracellular RNA
shell corporation
tax noncompliance
King Arthur
ontology
Semantic Web
Semantic MediaWiki
semantic similarity
open access in France
open access policy
open access in Portugal
open access in Spain
open access in the United States of America
open access in Norway
open access in the United Kingdom
open access in Japan
open access in Switzerland
open access in Australia
open access in Austria
open access in New Zealand
open access in Sweden
open access in Hungary
open access in India
open access in Luxembourg
open access in Germany
open access in China
open access in Belgium
open access in the Netherlands
open access in Finland
open access in Italy
open access in Ireland
open access in Denmark
open access in Canada
open access in South Africa
altered state of consciousness
macromolecular complex
death
North Carolina
violence
plant
slang
open educational resource
cancer
Majorana fermion
physics
nanotechnology
biology
number
open-access repository
open access
open data
cultural heritage
metadata
GLAM
Semantic processing
data quality
provenance
Taiwan
linked data
LODLAM
Comprehensive Knowledge Archive Network
Linked Open Data
Maltese
soil
immunology
coral reef
academic publishing
Disappeared indigenous women
violence against women
artist
medication
Baxter
robotics
artificial intelligence
medieval studies
women's studies
medieval studies
women's studies
Litchfield Law School
Litchfield Female Academy
women writers
theater
biography
children's writer
Coccoidea
orphan work
magic lantern
National Museum of Finland
Lahti City Museum
Kalevala
book
botany
business record
Comédie-Française
gene
rare disease
genome
genetic disease
DNA
cluster of differentiation
orphanage
residential child care community
pathogen
Ukrainian studies
scholarly communication
open-access publisher
library publishing
digital humanities project
gender studies
kidney
urinary bladder
citation
bibliographic metadata
author disambiguation
Aramaic
microorganism
protein kinase
computer security
symptom
London
genealogy
systematic review
history of books
Proboscidea
vascular plant
cultural heritage
theater
gender
sexual orientation
sex
health equity
dedup_wf_001::2d13dd919b0ec4519c4a0967c4c7cd47
dedup_wf_001::28e209b61a52482a0ae1cb9f5959c792
datacite____::b28d97a3796c731d7942540a524e838c
datacite____::99ab1f72bc7dd695c7a7c3cbe61a71c9
dedup_wf_001::9a3b12eae9d47e02ef64d44e7d810b45
dedup_wf_001::91e7741fa646208b4957a0787bdff276
datacite____::f9fbd711271a819edbc9fc9f4f40d649
dedup_wf_001::fd2e1d7f642d76c31b4bcaad920964f2
dedup_wf_001::9405e729b5f6f1170f1e8c9fae04449c
dedup_wf_001::6e1064560bceeef0fb808280313b5ca0
dedup_wf_001::565b4bb4c813ca7af0852174ce8036f4
r38d07aef7b7::039c7319b67bb87c9b7f62111caf65d1
r38d07aef7b7::17c276c8e723eb46aef576537e9d56d0
r38d07aef7b7::26da9d37357b01ee4fe35ce3fc969b1e
r38d07aef7b7::a61eded670f6e29acff242cae3b82a96
datacite____::d79fd236a9a91905c5bb199526073143
dedup_wf_001::daff15865daf824544b3a939d1f2bdc3
dedup_wf_001::0c7d40cfc50b5509570a6ebe7162c94c
datacite____::01bf8e9bb3c67b1432fc474bb0a3dc80
dedup_wf_001::0eb8d0c496ea727752cafc3c607d3072
dedup_wf_001::fa8c1cb0271969daab5d9a0f0c2592e2
r38d07aef7b7::4457906f472a0a4e966a17de3054a8bc
dedup_wf_001::b77b84aec681a41428579a44347402d3
datacite____::ba2372c047e9151f5566dc17c768e13f
dedup_wf_001::874be2e7cb10e5a88fb39039785ce274
dedup_wf_001::07db729092a1bf924ac83f935a954255
datacite____::0c5738984e63350071535fd8e73b35a5
r38d07aef7b7::124461dcd3571e6674ec4e0e140cc298
r38d07aef7b7::fd9f2aa91ceacfb305f86f2f76bfd494
dedup_wf_001::2f4cd0a689df7a6613b9ff4e84b34df6
dedup_wf_001::3c74f6d5c43355752f342f3e30bddf86
datacite____::152a0eb3563671c2ec7b2e6b84bd6d0b
dedup_wf_001::dd88b51c7ec34b0e50f7c49ed4164bb6
r38d07aef7b7::4ccc3735e387537e61269a976a33e412
dedup_wf_001::80cea7c26dcb03eb4b39e61b1effd8d1
datacite____::5b7018ad8caa4e45bac84223d93897c7
datacite____::66734f6fbfece8fe1dcdf2515628be52
r38d07aef7b7::3bcad4e7af821b33b29f7078b90ab75a
r38d07aef7b7::bddcda5d65fcfdec9de3838794a77265
dedup_wf_001::6f4922f45568161a8cdf4ad2299f6d23
dedup_wf_001::1a7f5cee6c09c0031cd5783d79740e14
dedup_wf_001::de50dd7c2f79237408a53ae6086551e2
dedup_wf_001::b6a05becd449b2d9d9d95010954c9308
r38d07aef7b7::133b5f08ade8b354bfd42b98c629ef05
r38d07aef7b7::13d429db192fbc7b5cabf9b936cf78e1
dedup_wf_001::7387f1ed39c0198734cd774f398e4398
dedup_wf_001::29c117378bda70200aa09a0baae05afe
datacite____::f4a9cdb6299031f87b702d50e93431d0
dedup_wf_001::261732af0fe337df28be726375463dce
datacite____::6093c3a026868d7dbcb976b2900976fc
r38d07aef7b7::abd815286ba1007abfbb8415b83ae2cf
r38d07aef7b7::e261489ab942429a6600c1c4121ac14d
dedup_wf_001::174f8f613332b27e9e8a5138adb7e920
datacite____::dcb90243f3119e47795e6dde40a1c44f
dedup_wf_001::44b0e8fa282c03644a16def023c48cdd
dedup_wf_001::462b7359bda3d8ed2873c091c2f3b367
dedup_wf_001::3a078109c16feb217c4a4b697d044990
datacite____::14063ac79d0d4dccffb95930463da296
datacite____::560451e078bd4a00adf84fcbf2d475b1
datacite____::379350260e640997e4a79c00b37d69aa
dedup_wf_001::53c16d65d012198a587f8745bad50014
dedup_wf_001::c0ff1c505fd116e5a8464fc4068554f3
datacite____::5af31a4dc554de0fd5bcf320a33e1494
datacite____::005f157c6f2ebd438eac5a55540457fb
dedup_wf_001::962e0272f808572f42c896e12720f625
r38d07aef7b7::1f10c3650a3aa5912dccc5789fd515e8
r38d07aef7b7::be1bc7997695495f756312886f566110
dedup_wf_001::8cde4f55f710f9f07236662ad05f7f05
datacite____::62cb9ba8515a239a73f5f21f98e10a0a
datacite____::0070899a302319800e37c660b21fe1e2
datacite____::13afe469cfa9d9840d9fb145ff1cd702
dedup_wf_001::18997733ec258a9fcaf239cc55d53363
r38d07aef7b7::754da7dc2ed681cb2084a83124fc63cf
datacite____::d9f9300f8c65d87bcadc4f061fe67a73
dedup_wf_001::9ec1e0f696fb8327e37674f9b67fec35
r38d07aef7b7::f1b8b7b3ceb65c188dcdc0851634cadf
dedup_wf_001::503a356966a9db3f68c9ca050d3d77fb
dedup_wf_001::413827d57b940b4b9f0d23012330d573
dedup_wf_001::4423231251806e094b61c5afeba7a535
dedup_wf_001::60bc551ca678c042256508c5a0f46689
dedup_wf_001::4f9e8bd6c0b2752cc4eb8115ee61c923
dedup_wf_001::c4b7a093d0c8baf772bf67cedc999d2c
r38d07aef7b7::4b4c6c207e1e59c5af70b3b4c7b46c5a
r38d07aef7b7::d542599794c1cf067d90638b5d3911f3
dedup_wf_001::d5bc8b494bd1879cf590498995206e14
dedup_wf_001::dcdeb3bdb79cff4f5225298409e438c3
dedup_wf_001::f88b71913a966f761ded194e27330ead
dedup_wf_001::17759a641175195d47a241f627dd7003
dedup_wf_001::1055f539f5e622ab14d46487c3daf73f
dedup_wf_001::4bd4226f0f3144fdc5647d65e5a8d873
datacite____::858ac0b14c81cdc8db1005a67ca816a0
dedup_wf_001::1d082b72d1f60e0582fc0ffe412aaac4
dedup_wf_001::365aa6ebdc3dbf28e7b9ea1c1b4d2908
r38d07aef7b7::5a90c7cf26f2109e4db8466c251911be
r38d07aef7b7::e5c6f944080958c264936693c43f8aaa
datacite____::4848ada5d4aae379ae89924371316479
datacite____::191647d9e0a75d7d0c797541df62e300
datacite____::55eecfe9e90ff06bcf1245658a4aed77
dedup_wf_001::f09e534a6bfc6b05de696f1fe27634c8
datacite____::694b608a1bbf54d3bf03f33478f62f0a
datacite____::3426c36e6b0668ea520b848862aaa343
dedup_wf_001::446ac7480f6cd015d176f8b3d28a03b5
dedup_wf_001::544791a2847e5e9324cc4747a27f7237
dedup_wf_001::3e161c33e87de157ab48186b6420e768
datacite____::798018b75283b200ce7052d73be3b7b5
dedup_wf_001::7f2dc9ca702c66e1dd36b63fdd0d2dae
dedup_wf_001::b65825e7b2c0e9d8e051d4a3b97ef088
dedup_wf_001::6fc1e19f936b4766aaf858f978dd9b0c
dedup_wf_001::8637141cb688de20443ba785b28e3ab0
r38d07aef7b7::243f6a5292350cc163601aac9ad3e854
r38d07aef7b7::52dbb0686f8bd0c0c757acf716e28ec0
r38d07aef7b7::8c41eebf5a1f5867cbe38cf59b37c1bf
datacite____::66fcb8d2a85690d69b9e29571a09362e
dedup_wf_001::1c383cd30b7c298ab50293adfecb7b18
dedup_wf_001::1707acd183aaa7bc989a6ac92fabb2c8
r38d07aef7b7::8db1625bead0f643f7f7913edc2a8434
datacite____::fbdbc51c1cae07b3f75f086256686c7d
datacite____::1f36f259997b5795202a0bebd2292007
dedup_wf_001::66dd89ef074e7ac9d4c1de6775991b0c
dedup_wf_001::6e0e24295e8a86282cb559b860416812
r38d07aef7b7::aebf7782a3d445f43cf30ee2c0d84dee
dedup_wf_001::9264177717e350795dc4687789512c34
dedup_wf_001::95375ffaf5a308b340e5eba805715568
r38d07aef7b7::07b2ee9f02d5e6e8894377afb4feed32
r38d07aef7b7::186a157b2992e7daed3677ce8e9fe40f
r38d07aef7b7::a431d70133ef6cf688bc4f6093922b48
r38d07aef7b7::ec0f40c389aeef789ce03eb814facc6c
datacite____::9ea315721d9089632922f8ca28c25849
dedup_wf_001::6cdb2c0acda55360ac8e3e33fc39bbd3
dedup_wf_001::b9e1be0891c6a16e9644a57b798ac8a0
r38d07aef7b7::41d626e181cd445e3cac18440a448424
r38d07aef7b7::5607fe8879e4fd269e88387e8cb30b7e
r38d07aef7b7::8f53295a73878494e9bc8dd6c3c7104f
dedup_wf_001::a368f8f84bce73d071a34722eb55f03f
datacite____::95843e4e22c345e8ee61f7ba834c70b3
dedup_wf_001::0181dbcc3606f670bbe50f984967f358
dedup_wf_001::063b7d7ae9cd5ea74e1f879c52a91917
r38d07aef7b7::298923c8190045e91288b430794814c4
r38d07aef7b7::987b75e2727ae55289abd70d3f5864e6
dedup_wf_001::03c65c6be9c8b37f09759c662325f152
r38d07aef7b7::e58cc5ca94270acaceed13bc82dfedf7
dedup_wf_001::20aee3a5f4643755a79ee5f6a73050ac
dedup_wf_001::47cf1cacc977063fd3ab8c1681d344c4
dedup_wf_001::f39650b66c31ee5d8d33a7ac2b5977ad
dedup_wf_001::338bda4610126bf5b01eb64f01c39b5e
dedup_wf_001::44ca77772ddd0d2200fd5e95cfb37ae2
dedup_wf_001::4e38a56a966fd2c2b4fc978ce57d20ee
r38d07aef7b7::cda72177eba360ff16b7f836e2754370
dedup_wf_001::04cf31999d95c51dc6b3eb0770c9b520
dedup_wf_001::cc2ec0e9790ea02622e0c9bea8822804
dedup_wf_001::09eeef09f2210c5176693da3b918d36c
r38d07aef7b7::1cbaa4e5609fb6517f54f0ab0c205ada
r38d07aef7b7::536a76f94cf7535158f66cfbd4b113b6
r38d07aef7b7::9430142689f1e3004253e1d85c9aef57
r38d07aef7b7::e84401ad27c4cfb9815776eb9432ff17
dedup_wf_001::c9a37ed9f5261e7b116c7cf0065c0794
dedup_wf_001::ba9ff9ccf2c1a2b138a20c5c2fc6501a
datacite____::eaf73b8884b8bfec62dcb523b454e2be
datacite____::dc875eac206cbf1660d30888f29db383
datacite____::c0acc1df4fad82cc743e2cbcac528e9a
r38d07aef7b7::d79b5c2f0375f87503706a142964d7d5
dedup_wf_001::6e2715c4ee2dd9a6eabb9279d6684699
dedup_wf_001::5f5c048868bce9c55853e587a4ced9d0
r38d07aef7b7::531db99cb00833bcd414459069dc7387
dedup_wf_001::1643511b5df8dc9bb2bc4b5d712370ca
datacite____::3b28b73795eb7af88aba69cbe8314005
dedup_wf_001::264f2e01237058d1ae12f4f56ced8347
dedup_wf_001::96feb27374dde404eec29783f9c5b504
dedup_wf_001::0829cab14fd3f2444652a9cf2b779732
datacite____::f00a6893d85d8f44f6f4eec2dac8d4f7
dedup_wf_001::5d5f388fcd32bfec29ea54c9fdb1e578
datacite____::2e76c94162df19df9b7db19bd211904b
dedup_wf_001::5d02327b6b75f0167a87557b655ba440
dedup_wf_001::34fde688ef533f60a90e13e3238fc23e
dedup_wf_001::26408ffa703a72e8ac0117e74ad46f33
dedup_wf_001::197de90defc3b543859d0c1ad3c2c77e
r38d07aef7b7::14ea0d5b0cf49525d1866cb1e95ada5d
r38d07aef7b7::361440528766bbaaaa1901845cf4152b
r38d07aef7b7::cceb1161867ab91def7fac026ead455c
dedup_wf_001::db8361c8ead35eb0376d50cba21777e4
r38d07aef7b7::78ccad7da4c2fc2646d1848e965794c5
r38d07aef7b7::0f9ef8cb70bb4135133a24a464ad55e1
r38d07aef7b7::137ffea9336f8b47a66439fc34e981ee
r38d07aef7b7::cfd66e741860718ddecf1f6eabd05fc6
datacite____::949f6a3dacb58c49d5efe348d9a07f3e
datacite____::bdaf5bbdff5355a78584bb805686c2f7
dedup_wf_001::1d2aee2a6f5c58a1561041f50cb27981
r38d07aef7b7::9af76329c78e28c977ab1bcd1c3fe9b8
dedup_wf_001::56207854195a80308778147e1f7c7728
r38d07aef7b7::50177f8a9ab8866cb77c77ae1e47c5fa
dedup_wf_001::ad38ae802c1dba5c60a3ea8e4f5b1e08
dedup_wf_001::74929e31e2071052d67719bd92af6aba
datacite____::27cf64b29ad0dacffe8d397c221224f6
datacite____::aa970012fe3b849f5afe19c372ecf2e6
dedup_wf_001::29d6d05568a6fa882f3ff0b5c9ece960
dedup_wf_001::ee6f4043bba9021436c19cea29eb9b08
dedup_wf_001::3f845e7b5d2ed2f3ff23ff4f96da779b
dedup_wf_001::417229acc3bc53ab3c260f42a5780caa
dedup_wf_001::9db4f120e04f921a331a70f1e5d41ef3
datacite____::5cf7bbf9c1c71c410649210e79685402
dedup_wf_001::3d19b555f4a3f063b0cf7660af9fbf79
dedup_wf_001::27213d4c9e43e44257a1bafb26d1cfa5
dedup_wf_001::f874adbde59aa80586975c7e15fd5232
dedup_wf_001::88df1d18a85e41643042895f6c9c336a
datacite____::d63344ce48568515fb7b9c64efdaf68e
r38d07aef7b7::4e4dfebee38dd25062b6888505bcca50
r38d07aef7b7::6547884cea64550284728eb26b0947ef
r38d07aef7b7::6de4bfe9504589a457d6e92fae4f9613
datacite____::05b5e8e5bc3b197e0719c06104414f2d
dedup_wf_001::8fbbac76293ae93d906ca0f49ab85b48
dedup_wf_001::543a84894716c6c0ed6cabe60aac9945
dedup_wf_001::430108f92cc52a0e62cdec1b8df60297
dedup_wf_001::3b944683a96ec43a374835f2f2691d81
datacite____::8e8d6f148d9e7b46c2b8af9d47546a08
dedup_wf_001::cc086bcd836b5672ca48377fb58139f6
dedup_wf_001::3ee4f8cffd4de02743d2dae81ed6e82b
datacite____::66a1c3eeeac3ddaca069542a9800a106
r38d07aef7b7::a60b48c9d56949d618129c45511b5cad
dedup_wf_001::02e2259af55fb6a382ca7aa9f13319b3
dedup_wf_001::37967876fe062fb38ae244af3f793d71
dedup_wf_001::9675f0018d103531e073a3da2945df41
dedup_wf_001::983a6e7a05d3fb4d27f85e86ef2a71c0
dedup_wf_001::ffcc0381a555d393b99f4efeecbe186f
dedup_wf_001::e49c24db85af66780b897252c6a01866
dedup_wf_001::3d684d83ae41e66000194029e38a36d8
dedup_wf_001::6c2be392910c44524d26880c6ae35b48
dedup_wf_001::6d17ca3223407ef8f72e53cf865fed4d
dedup_wf_001::13c1e9d1de2f0482c2243be60006a19a
datacite____::75569e6cfc6d139792da0782049a42b2
dedup_wf_001::7d2b92b6726c241134dae6cd3fb8c182
dedup_wf_001::18def9f1e15f5c8cb3f88ce32d0388f4
dedup_wf_001::75822634893f55d9a2005bed4c612f1c
datacite____::f6b893e834147be4b637a0d7443ff9c7
datacite____::1d3c862887c18173a8fe037be9bdabee
datacite____::ea1bee9747f7279685714f5be8421c14
r38d07aef7b7::df877f3865752637daa540ea9cbc474f
r38d07aef7b7::eb2538078fc0e47beef6c4bd5188c471
datacite____::efc3049f89651aaa1edc882e65e203fd
datacite____::7b86fb6d10a17a0fe17722b8011cd71c
datacite____::4d0f219162004041ff0733e9a65cb007
datacite____::be1b6cf4ac2d7984282389904d727955
dedup_wf_001::75db916b377e3165bf35abb397a7cfac
dedup_wf_001::4fc28826fe37fa34663f221c2bab937b
r38d07aef7b7::1e5afae270de728fd14f20133233d33a
r38d07aef7b7::2fc02e925955d516a04e54a633f05608
r38d07aef7b7::471c75ee6643a10934502bdafee198fb
r38d07aef7b7::8e68c3c7bf14ad0bcaba52babfa470bd
r38d07aef7b7::9ed017d7372360c256add7a8fe35a0a6
r38d07aef7b7::ecaeafc0832340d4da28bd0370c03094
dedup_wf_001::5490e4f6955c4b2a8f2763bbe336ddf5
r38d07aef7b7::cfecdb276f634854f3ef915e2e980c31
datacite____::d48abd8596cb55452f8c07aedb7c3b23
dedup_wf_001::8964b6fed0c6a0ec12da82220a30acbc
dedup_wf_001::23700fe6da220c7f55d656d0683d0738
dedup_wf_001::61971e77205538fca2c4881aef59479f
datacite____::6c53efd0ebf25715dc9e6fb3d6d687b5
datacite____::6bf835676954787dca207ebb300273d5
dedup_wf_001::146e8732caf21f3894e93086140512a3
dedup_wf_001::13ef8ed30b64b2b14db548c261c8c883
dedup_wf_001::58268887d013c9f1673fdad95024bbe6
dedup_wf_001::9198c75f61dcf56b585072b7688505e4
dedup_wf_001::8ae7733f9bc11275e8d0a0fdabe5be0a
dedup_wf_001::e2c37302cbfdb63fb34daac0c6dbbd79
dedup_wf_001::b7d26731ec66915057fa34708db628bf
datacite____::2fa2238a2b3be2be73178f81a1efce8e
datacite____::4ea26001033ecb866ac1511a8046a39d
datacite____::e985b68394e01237631e34a732768c7e
datacite____::0cb90d01629c53cecee9432e990729a6
dedup_wf_001::8ad0f4cc8e986e45bf8f1b2b4e7d8600
datacite____::47ecfe3a0c954e4073b426c5fd3230fd
datacite____::7288e3745d519df376d0fbb61bad4db4
dedup_wf_001::242b9b4c4271fe4d56d4039dbf41572d
datacite____::7d8ec8ffe3244c8bea4af57bbb37e938
datacite____::d6fc7b2b7ad0f3903417a455a0ace9c6
dedup_wf_001::b093e2197a49e076c37add66d52102cf
dedup_wf_001::3e4aa8e05e65178e26aed32291ca1d7c
dedup_wf_001::e627b271e4f233a60a896e9ffb3a175b
datacite____::eb6f3ae189d959021b29a498eadca680
r38d07aef7b7::ac2d43ef3f26cc74de242202e822ecb0
datacite____::9f1e3177c8f870b5feb19ddad2da6762
dedup_wf_001::1de2543a72800e2e61fd582da3a752f6
dedup_wf_001::033983e84a082f5ad40b077d3e6edfbd
dedup_wf_001::289c30baffae946d976d4d7777bd44c7
dedup_wf_001::5e461fd134d7a62ccb1c7f413a3028d8
r38d07aef7b7::19b5f0dd9d71b2003189f2d35a7c89d1
r38d07aef7b7::98afdcc1ebd85daa0f1749c5e56b9d8c
r38d07aef7b7::a8ecbabae151abacba7dbde04f761c37
datacite____::f0108c21bd777755b87c787f598a84c2
dedup_wf_001::b22890f15a401a1b855ceedf98b785fe
dedup_wf_001::14652db923ec15adc21191dfbfb70b2e
dedup_wf_001::648a984e9adcde2fa868a0f8bd36ab6d
r38d07aef7b7::54f5f4071faca32ad5285fef87b78646
r38d07aef7b7::7c1192a2afb55fdee2a326ef8de8a3a0
r38d07aef7b7::f5e083092550d2f93898e9829e677e39
dedup_wf_001::4bcc9b7fafa01106a128cfad3cebcf16
dedup_wf_001::59ad2d62dd9fb49c02416cd78ddf0beb
datacite____::994637be6a83574e514d98326f6e69c2
datacite____::a3c6fb4b7dd5f6dbdd7cc479a194503e
dedup_wf_001::d88dae8a9ffdc1e2290ddf1c3658c0a8
dedup_wf_001::466ad710a1f835f157d6f375efb4434b
dedup_wf_001::1ffe4b75670b34433401171b486b787c
dedup_wf_001::02500a38cc6522f4d58ec42959990abe
r38d07aef7b7::a97da629b098b75c294dffdc3e463904
r38d07aef7b7::b2f83c409ce63012229fb9cd465bdcfe
r38d07aef7b7::53e232bcc4a6386499454667194addd1
r38d07aef7b7::621fbd17da27241c58015eabe4164a52
r38d07aef7b7::416849da96fb73bee793e2bf65ae43ac
r38d07aef7b7::74db120f0a8e5646ef5a30154e9f6deb
r38d07aef7b7::976f3d77e359f934970e7287f2318116
r38d07aef7b7::c4c65c2e1f678ba44aa520651fee3941
datacite____::ee23dc27e33651a38376e8b66fe93553
datacite____::175b852967276c0c1d7d92885ec98523
datacite____::408651b0fe3138755e69ff86c6040f70
datacite____::3b8cb72745e9715e6db6e13202d762d7
dedup_wf_001::6b18886bc278247582704943f5c66eb9
dedup_wf_001::732ecd34115f8996fb9fa4ae613f89ba
r38d07aef7b7::0d0fd7c6e093f7b804fa0150b875b868
r38d07aef7b7::ea9c07b5d0be2b8ee4631ee110f97fb4
r38d07aef7b7::f21e255f89e0f258accbe4e984eef486
dedup_wf_001::1e52b1356d89b333af2e7cc9414894c5
dedup_wf_001::705227e50c215ba45f4f457882c90607
datacite____::51b1a85f5cbfe2cea1449a654f857999
datacite____::62c2aafd182bd92e51b897bfec9a8732
dedup_wf_001::3d064e0a5c576bd7f01f50a434e59c81
datacite____::be28173838ccc613b589e35e30d9cd1d
r38d07aef7b7::bcbe3365e6ac95ea2c0343a2395834dd
r38d07aef7b7::e103d1ed1d6c41b0f098ff377dde2966
r38d07aef7b7::e2c61965b5e23b47b77d7c51611b6d7f
dedup_wf_001::016ad0c411c1a571ecc34b1addd78c4e
datacite____::0b4c9e61bd83b8248dde2e8108a1d2a6
r38d07aef7b7::68af34529bb4ab8575457d9e16801849
dedup_wf_001::8b2091796967c6935a7bdfde87fe7604
datacite____::cd7e06ba3f1c0dc47b7bb7c1a6deb2c9
r38d07aef7b7::05a5cf06982ba7892ed2a6d38fe832d6
r38d07aef7b7::13e5ebb0fa112fe1b31a1067962d74a7
dedup_wf_001::28aea90b6334886746b6bcd368670b2b
dedup_wf_001::ce98a19af0a0f0c2b1b402f9ca0706e7
datacite____::76549c6c590b06a61dd1e893c5ae9637
datacite____::5c0f10a6bcea9e4003232056a5cc71f7
dedup_wf_001::3c983381665c92b6082a37cd7f7752b0
datacite____::61e5dfd8d039cf15772ab5255db65443
datacite____::869834d55887ca42da79370b0904b8e4
dedup_wf_001::79a49b3e3762632813f9e35f4ba53d6c
datacite____::75c17f22977302fcbefb8e6104a68932
datacite____::cd4572f28d7940bad945fb63f4a460b7
dedup_wf_001::9843a745d90a5a55cb0039aadeea32c0
r38d07aef7b7::295029833128d5e7b4f965599342d793
r38d07aef7b7::ecd62de20ea67e1c2d933d311b08178a
dedup_wf_001::348554ac0daf5675fa4e45f9b6c71006
dedup_wf_001::21cf1e2c7605ae77ececeed18a7e2c96
dedup_wf_001::9e46f1febae5c757b00c5d088ce85825
dedup_wf_001::68c694de94e6c110f42e587e8e48d852
dedup_wf_001::4fbeacd6aeb8b09cbb453009783148f9
dedup_wf_001::50032b5dc686cd7a73d7b71794ac21e6
r38d07aef7b7::36a1694bce9815b7e38a9dad05ad42e0
r38d07aef7b7::dd939412d661b27a92e611a89e977f0a
dedup_wf_001::0c4313664052e64896fc623e29bccb87
dedup_wf_001::560421d76f9da92cfca0742842f2d1ed
datacite____::326991d241a3450480a0cafe60cb69da
dedup_wf_001::0111809bba9c15ce87713981bd8201e0
dedup_wf_001::144375bda25220a19494870a020d60ef
dedup_wf_001::04112144566d77d75f935b26101dd71d
datacite____::d070a076bb93165ccf3f8f1c767bb855
r38d07aef7b7::f84d465177e84bb4e756a8319443cdcb
dedup_wf_001::87ec2f451208df97228105657edb717f
dedup_wf_001::4ec405bc6270f314eadb139c14099a4b
dedup_wf_001::93d6db6c098728cb60cdbdd2567150ab
r38d07aef7b7::58ca56b8d08b89f6972767847e087c72
r38d07aef7b7::61fb56acb88a66651048b4b2086d5b5a
r38d07aef7b7::8e987cf1b2f1f6ffa6a43066798b4b7f
r38d07aef7b7::b8a6550662b363eb34145965d64d0cfb
r38d07aef7b7::a8abb4bb284b5b27aa7cb790dc20f80b
r38d07aef7b7::018dbfb5fec8d864714ede49cef50343
datacite____::db81fa74a252ff64106cd90db5b4003c
r38d07aef7b7::ca0daec69b5adc880fb464895726dbdf
dedup_wf_001::081af284f65bf4aa9e0c90c0ab2137bf
datacite____::02da3d8c6b76dfc6113deb31be56046d
datacite____::17ffffe91dc151c24d308225e955b579
dedup_wf_001::2173f0f840665f35051c72fff8137ef2
dedup_wf_001::4e6f583a702f7d015ca5439403429535
datacite____::a15ab1619aba7071dd86ff49cd00974e
dedup_wf_001::29f08fe3ea82326fc67685c9a8cb9909
dedup_wf_001::08040837089cdf46631a10aca5258e16
dedup_wf_001::32c8c34d2691ea14db86416811a29726
dedup_wf_001::43c183b75768df57ede4d6b5361e9311
dedup_wf_001::7f9f1c8d90c069f16dc638b529ba03ba
dedup_wf_001::87fea017dd924d0be1eb71951e50148f
dedup_wf_001::e7b125dae1dc6d70a6b4c46e42800ae4
dedup_wf_001::e8cd13ac4246ea90c1dbb5aed941c8ed
dedup_wf_001::9654f1faed5e2011c2ce76eecfb76325
dedup_wf_001::102b91e75544875f2a482fe6f9fe18b6
dedup_wf_001::a5f7b3e3fdf4559cebd36f8fc57adf16
r38d07aef7b7::65c57c59b1c396fb0bee33d21a7fe822
r38d07aef7b7::e69cf84ed41fbe71985972c027190b49
dedup_wf_001::c470366225ec05683c4306d5eb22762c
datacite____::eb40f91c891e0f366e4905091ac254c6
datacite____::1c0a8ef7fbd0c4476e06f6274487072b
dedup_wf_001::166171ebacbd5235960b5d8faf4437f7
dedup_wf_001::ad36ca1e07a86a645286e275594b9e43
dedup_wf_001::03afdbd66e7929b125f8597834fa83a4
r38d07aef7b7::704afe073992cbe4813cae2f7715336f
dedup_wf_001::bee7df60a52591ffcb299458de260512
dedup_wf_001::55f57b40ca8a55340f65920258e213fe
dedup_wf_001::9f823045742f913d36ae4d9a9f0d75ad
dedup_wf_001::34b6467a7039bd0e8aa5f6983aa303b0
dedup_wf_001::654897e28023b9d57f35dc4424b892fb
r38d07aef7b7::46d09c503b30980ffc325cc243e1c0f5
r38d07aef7b7::63ceea56ae1563b4477506246829b386
r38d07aef7b7::ceff40a7fe4e78d7c988ed83759f7d91
datacite____::710404f715306843504c991c3460bc0e
dedup_wf_001::06b6c25e06b0a7fbb9bfa0aeccaf34b1
datacite____::004922adb51caeec5c859eb58568fbd7
dedup_wf_001::bd5c5e1c04111451ed8b63079ea181e7
dedup_wf_001::90aef91f0d9e7c3be322bd7bae41617d
datacite____::d3214ebb0ff294f713a2a9cac8695f6b
dedup_wf_001::b69c16d99962a7bb5c3d97761ba9727c
dedup_wf_001::30547fb3599ed51e4f075ba9c753c8c3
dedup_wf_001::944fd3693dacd39280bf4c651c2a149e
dedup_wf_001::951b9abf25fef051c0486f49a1d44ee5
r38d07aef7b7::2b38c2df6a49b97f706ec9148ce48d86
r38d07aef7b7::4a11654ad1e1e48352252859ff3032a0
r38d07aef7b7::5bd7ddf87f22021a5f5d682ce5f93ad6
datacite____::48dfb0cf4e383428f5dc2a6763d51782
dedup_wf_001::51d3a6f35b8dfe611ff24214c8ef79d1
dedup_wf_001::f939c4bc508ae7ffe02769f71e883a3f
datacite____::4946a918e5703257dade63f00c21e0fc
r38d07aef7b7::4d771504ddcd28037b4199740df767e6
dedup_wf_001::539d6630cf6e6141a0cde76b7d24adb5
dedup_wf_001::59fcb3d5af85bc0ac8ae2de7fcec84ee
dedup_wf_001::7bca78fcdba29dd12d74ba20a5afb058
r38d07aef7b7::803dddd7ea91e91ff16610f6c8009355
r38d07aef7b7::9308b0d6e5898366a4a986bc33f3d3e7
r38d07aef7b7::bd1354624fbae3b2149878941c60df99
r38d07aef7b7::ebcd0fb1c44b0ed07842254daec4c3cc
datacite____::a07a322113e29c41fb87367cdeca13b5
dedup_wf_001::0405f859a58c10eb6d646b9f31c569a5
dedup_wf_001::0e87de457be3fffbc57408200d762452
dedup_wf_001::617b469f86bdb00a998cba5afdb2f8ce
datacite____::ae0b73e683137bb5f1eb97c2ca92d6f4
dedup_wf_001::2e1968be4ca3bf9143cc09de21bb7015
dedup_wf_001::411ae1bf081d1674ca6091f8c59a266f
datacite____::9af7f9365530f0e0f90f59efc383d0b0
r38d07aef7b7::2aa9c1afdc1323b9c19b35a4a09b989b
datacite____::3655732d9d09342fa1de8b0bd5f92614
datacite____::3b27b78d0996506638ba6f5463516bb2
dedup_wf_001::f8e6db8fcc7e428005cf296ea2d7e8eb
dedup_wf_001::026947ba375f344b921f7c825ec784c1
dedup_wf_001::22e90e50fd9480d9f0e6142740238040
dedup_wf_001::5fec4bb494d06947cad115993fc92794
dedup_wf_001::18e39d0ce826acbaba5877d4eaa3857d
dedup_wf_001::8e62985e4b2d75cb6e07e5cd2a71006a
dedup_wf_001::4c5ae0c7f73c189ea164f56f7aee8878
dedup_wf_001::c80d9d47dd687f273bf375d81e6aa393
dedup_wf_001::a714695bf9f489d1bad29c5aff5acadc
datacite____::886e4c4c825678a89a97c0a2f139affe
r38d07aef7b7::1b9812b99fe2672af746cefda86be5f9
dedup_wf_001::ef0178ca5c3bfb727a260fe1f2802595
dedup_wf_001::96445d7c7343f3aac48e05721bd4d5a4
dedup_wf_001::925c923915874d0820ac71d1a05ce30b
dedup_wf_001::0a9c4bbc33e3c6cb716683c1fdcfa388
dedup_wf_001::fdabfa58ee681c90b867cdacc5cf0efc
dedup_wf_001::0e087ec55dcbe7b2d7992d6b69b519fb
dedup_wf_001::692672a7737fd732204b9233245c22ac
datacite____::e23d1462f95afef4e05be178ab20ccd9
dedup_wf_001::1cb03f96b2b4f9015c5abe7f331eb12f
dedup_wf_001::eb28cc74968e585054a4d577ea04d80b
dedup_wf_001::44e9fb1fc36212c86eff5e797d123d7a
r38d07aef7b7::19bc916108fc6938f52cb96f7e087941
r38d07aef7b7::f5701b023d76d7b269d43e06c4a879bd
dedup_wf_001::1cecc7a77928ca8133fa24680a88d2f9
dedup_wf_001::444eb4f45f208a9a73e3739f4797ca83
dedup_wf_001::285ab34d956801aa940bb44874d1b54d
dedup_wf_001::9e3b42157b99a2359b1a90ba92835702
dedup_wf_001::e908c567ab9a95320696d05c1e7cdc58
r38d07aef7b7::4191d9903a4cf9f293dbbbff63f119c4
datacite____::21886d4c09205efd26436813f063a80b
dedup_wf_001::764f113b98a4cb6509c0a1b76c25d000
dedup_wf_001::246fb7bbab5a23a9f0bc79b9853433bc
dedup_wf_001::ad4fcd9edae2b298b9b53e4f105db1ce
r38d07aef7b7::a3a92e719349dda06de72dac3448e149
dedup_wf_001::3503afb89f7576c2518b2194382c9a23
dedup_wf_001::069203996808cec2bcf9e33b4908f9f4
r38d07aef7b7::cc9b3c69b56df284846bf2432f1cba90
dedup_wf_001::392eb1b988bc2beaacc2b67cbcf9a58d
datacite____::97f37d29f1bf06a33595ac89d151c7d6
dedup_wf_001::9a508fc488295a9c3869445ebe0c284b
datacite____::35d174fb24b75fb44e7868f1ddaeec7f
dedup_wf_001::536ae40810ea1b3ff1237dd4c6e23712
r38d07aef7b7::8e60cfb63ef8bedd98f6868c6accf1c2
datacite____::b66fb9de12cab6a729b9bd4290945707
r38d07aef7b7::0307fec2cef6aec340b8426490977ef0
r38d07aef7b7::a1afc58c6ca9540d057299ec3016d726
r38d07aef7b7::bd9e928c0f0fba89b5c8254bef1f9937
r38d07aef7b7::ec8ce6abb3e952a85b8551ba726a1227
datacite____::6a03aadb1fbf114b543aadcfc9c8d788
dedup_wf_001::e78a21bbab530f289b3691ed174d6924
dedup_wf_001::20a52a3b4dfef95631fec59da40d36c7
dedup_wf_001::5f9af18db2421d982fdfe3cb5458f90c
dedup_wf_001::c55d22f5c88cc6f04c0bb2e0025dd70b
r38d07aef7b7::46ba9f2a6976570b0353203ec4474217
r38d07aef7b7::f4573fc71c731d5c362f0d7860945b88
dedup_wf_001::c90523b43fa2691f243148f6edd965d6
datacite____::89aa6aa644c692fb2368b9078e4cfe15
dedup_wf_001::b58ca2e39e94f994f0d8eaad788de687
dedup_wf_001::1f64e2558ab55f2618b7c651332bf101
dedup_wf_001::6d4bcfa605eacb74a48e2a0a871be965
datacite____::f15019f80cda9351d959741140dd7f42
r38d07aef7b7::097e26b2ffb0339458b55da17425a71f
r38d07aef7b7::60e2126ffb2e2246df6c57b3797f2b48
r38d07aef7b7::4704d0a8754f7cd9619e6a8fab4c1021
r38d07aef7b7::c23da4fc9c3c0a2322caf4fa66762d78
r38d07aef7b7::fff23c80b2468e9402716e56f083ebc8
datacite____::2d218a0ab2e02e726f5e3278760b4fed
datacite____::44203d4bcf1b2a1ca77e2dd7e30b8f0e
datacite____::e2bfbbd0d9f7d21713560afa1261392d
datacite____::2df2fd359b8c784d4839b8b3d709c474
datacite____::a68d51b3b7b4ee0d7488f3b39cc73ec5
dedup_wf_001::33ceb07bf4eeb3da587e268d663aba1a
dedup_wf_001::06b5a77135669a22b90a089424743b9f
dedup_wf_001::1c10e23137bab2e90f24434c803301dc
dedup_wf_001::16d2db4f9c9f181c83c5ce5271e06429
dedup_wf_001::16ec018b232ff81024ae554497ea68c3
datacite____::66c1315f7576edda79dcce364f1ffa78
r38d07aef7b7::02522a2b2726fb0a03bb19f2d8d9524d
r38d07aef7b7::f542eae1949358e25d8bfeefe5b199f1
dedup_wf_001::d743be2e035291bcec7abde6ade28cd3
dedup_wf_001::0045efa3a3907101711325a75e00db21
r38d07aef7b7::995665640dc319973d3173a74a03860c
r38d07aef7b7::9ca90593821a015f234e9a8195ae5582
r38d07aef7b7::a1d0c6e83f027327d8461063f4ac58a6
datacite____::343eb140e814a0515e112d89ad1a676d
dedup_wf_001::958adb57686c2fdec5796398de5f317a
dedup_wf_001::524f141e189d2a00968c3d48cadd4159
dedup_wf_001::0cbb516d7d00fa1273df1bc2a768a055
dedup_wf_001::4837ba5cd49c7f03caaa423049e66daf
r38d07aef7b7::9188905e74c28e489b44e954ec0b9bca
dedup_wf_001::28aaad2dc33d001931d6c1fc41d62d20
dedup_wf_001::008f0dc9bada25da32d1b92f19e552fb
dedup_wf_001::bcf5c3115d32f7bd9e840e10725aed81
datacite____::c49ffce94b82f8faac50d14f1307079f
dedup_wf_001::5265a86a57edd3bd8507590fa1a08513
datacite____::781474be5f3c71eb6f29cf8877d6a622
datacite____::8d2484a1828e506d7d0c2f8eae9885e7
dedup_wf_001::5ecf9fff829b9cafb060d99880acf7d9
dedup_wf_001::4d73d1729595f9b09fd5d216b95a13da
dedup_wf_001::315e873e9b08aa9238a2e74ef9f731a9
r38d07aef7b7::10a5ab2db37feedfdeaab192ead4ac0e
r38d07aef7b7::999028872cfff7ae8ee330a33cbd3874
r38d07aef7b7::fc1dc4549df0335d7f506edb5d66af16
dedup_wf_001::09b0d2d491b0e7943dc4bd180135dcb0
r38d07aef7b7::468cbac056133a996283cca7e2976336
r38d07aef7b7::49ca03822497d26a3943d5084ed59130
r38d07aef7b7::50a074e6a8da4662ae0a29edde722179
dedup_wf_001::e17705877fd904fdef65cc892e4e5a6b
dedup_wf_001::58ecc182286e496e4702f13edc491e47
dedup_wf_001::6c5bbaa133d30e1df4ac9ee7d0612f20
r38d07aef7b7::4f16c818875d9fcb6867c7bdc89be7eb
r38d07aef7b7::5d4ae76f053f8f2516ad12961ef7fe97
r38d07aef7b7::72a36e8158ffceef8dc28aae2880f440
r38d07aef7b7::b2abed343c4faf5d13c585bbf2429538
r38d07aef7b7::cf1f78fe923afe05f7597da2be7a3da8
dedup_wf_001::4a94b9bd5b76175b0571e7e9a0336c3f
dedup_wf_001::b84e97b375927518383947b0df7b84e6
dedup_wf_001::0c76a80b0d066e623aaad081aea57c8a
r38d07aef7b7::73c4fa58d428d52c2b12e11f3b28e8f5
r38d07aef7b7::9bf521d44fb17b1abc961ebc08c28de0
r38d07aef7b7::f804d21145597e42851fa736e221da3f
dedup_wf_001::3dff03d96f8f6d9d22d64c26a08e71f0
dedup_wf_001::6f10b5c4a55511c343a652b3fbb61b29
datacite____::65ce3001e1461d7981931bbf3c896c69
r38d07aef7b7::1e4bb9409318b6d3303cb415a351674f
dedup_wf_001::3326718c02f10324bec8f90abe36005f
dedup_wf_001::2e0912ea453bee6ca8861f354fb1a491
datacite____::a0d22241c4b116f8b8bf843efde82ce5
datacite____::a1d6e092567d8ed9bea4d16b49cb465a
dedup_wf_001::1c1d4df596d01da60385f0bb17a4a9e0
datacite____::84917c8ad5f0284aa465223546e07348
r38d07aef7b7::48f7d3043bc03e6c48a6f0ebc0f258a8
r38d07aef7b7::641e167f974d1dd076c0886d17271975
dedup_wf_001::49ef08ad6e7f26d7f200e1b2b9e6e4ac
dedup_wf_001::0a3013c44b99f3a47d5c9462bf6b31bf
datacite____::5dc48b7ac45c2422192fea9a9b3dc525
datacite____::fcb29b7a9a5d3b53a7a71557457ebdd9
r38d07aef7b7::354680832fcea7e2b7057a5ac2c489f8
r38d07aef7b7::9a1158154dfa42caddbd0694a4e9bdc8
r38d07aef7b7::1c54985e4f95b7819ca0357c0cb9a09f
r38d07aef7b7::4a4b16d454ca9f9075c129f6a0384d3d
r38d07aef7b7::95151403b0db4f75bfd8da0b393af853