,description,datatype,comp_name,comp_type,subtitle,EvaluationAlgorithmAbbreviation,data_sources 0,"'`When an employee at any company starts work, they first need to obtain the computer access necessary to fulfill their role. This access may allow an employee to read/manipulate resources through various applications or web portals. It is assumed that employees fulfilling the functions of a given role will access the same or similar resources. It is often the case that employees figure out the access they need as they encounter roadblocks during their daily work (e.g. not able to log into a reporting portal). A knowledgeable supervisor then takes time to manually grant the needed access in order to overcome access obstacles. As employees move throughout a company, this access discovery/recovery cycle wastes a nontrivial amount of time and money. There is a considerable amount of data regarding an employees role within an organization and the resources to which they have access. Given the data related to current employees and their provisioned access, models can be built that automatically determine access privileges as employees enter and leave roles within a company. These auto-access models seek to minimize the human involvement required to grant or revoke employee access. Objective The objective of this competition is to build a model, learned using historical data, that will determine an employee's access needs, such that manual access transactions (grants and revokes) are minimized as the employee's attributes change over time. The model will take an employee's role information and a resource code and will return whether or not access should be granted. Partners This competition is hosted in collaboration with the IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2013)`'",tabular data,Amazon.com - Employee Access Challenge,featured,"Predict an employee's access needs, given his/her job role",AUC,amazon.com-employee-access-challenge 1,"'`The Game of Life is a cellular automaton created by mathematician John Conway in 1970. The game consists of a board of cells that are either on or off. One creates an initial configuration of these on/off states and observes how it evolves. There are four simple rules to determine the next state of the game board, given the current state: Any live cell with fewer than two live neighbours dies, as if by underpopulation. Any live cell with two or three live neighbours lives on to the next generation. Any live cell with more than three live neighbours dies, as if by overpopulation. Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction. These simple rules result in many interesting behaviors and have been the focus of a large body of mathematics. As Wikipedia tells it, Ever since its publication, Conway's Game of Life has attracted much interest, because of the surprising ways in which the patterns can evolve. Life provides an example of emergence and self-organization. It is interesting for computer scientists, physicists, biologists, biochemists, economists, mathematicians, philosophers, generative scientists and others to observe the way that complex patterns can emerge from the implementation of very simple rules. The game can also serve as a didactic analogy, used to convey the somewhat counter-intuitive notion that ""design"" and ""organization"" can spontaneously emerge in the absence of a designer. For example, philosopher and cognitive scientist Daniel Dennett has used the analogue of Conway's Life ""universe"" extensively to illustrate the possible evolution of complex philosophical constructs, such as consciousness and free will, from the relatively simple set of deterministic physical laws governing our own universe. The emergence of order from simple rules begs an interesting question--what happens if we set time backwards? This competition is an experiment to see if machine learning (or optimization, or any method) can predict the game of life in reverse. Is the chaotic start of Life predictable from its orderly ends? We have created many games, evolved them, and provided only the end boards. You are asked to predict the starting board that resulted in each end board. Although some people have examined this problem, it is unknown (at least, to us...) just how difficult this will be.`'",tabular data,Conway's Reverse Game of Life,playground,Reverse the arrow of time in the Game of Life,MAE,conways-reverse-game-of-life 2,"'`CIFAR-10 is an established computer-vision dataset used for object recognition. It is a subset of the 80 million tiny images dataset and consists of 60,000 32x32 color images containing one of 10 object classes, with 6000 images per class. It was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Kaggle is hosting a CIFAR-10 leaderboard for the machine learning community to use for fun and practice. You can see how your approach compares to the latest research methods on Rodrigo Benenson's classification results page. Please cite this technical report if you use this dataset: Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.`'",image data,CIFAR-10 - Object Recognition in Images,playground,"Identify the subject of 60,000 labeled images",CategorizationAccuracy,cifar-10-object-recognition-in-images 3,"'`A good chocolate souffl is decadent, delicious, and delicate. But, it's a challenge to prepare. When you pull a disappointingly deflated dessert out of the oven, you instinctively retrace your steps to identify at what point you went wrong. Bosch, one of the world's leading manufacturing companies, has an imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. Part of doing so is closely monitoring its parts as they progress through the manufacturing processes. Because Bosch records data at every step along its assembly lines, they have the ability to apply advanced analytics to improve these manufacturing processes. However, the intricacies of the data and complexities of the production line pose problems for current methods. In this competition, Bosch is challenging Kagglers to predict internal failures using thousands of measurements and tests made for each component along the assembly line. This would enable Bosch to bring quality products at lower costs to the end user.`'",tabular data,Bosch Production Line Performance,featured,Reduce manufacturing failures,MatthewsCorrelationCoefficient,bosch-production-line-performance 4,"'`Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.`'",tabular data,Rossmann Store Sales,featured,"Forecast sales using store, promotion, and competitor data",RootMeanSquarePercentageError,rossmann-store-sales 5,"'`Instead of waking to overlooked ""Do not disturb"" signs, Airbnb travelers find themselves rising with the birds in a whimsical treehouse, having their morning coffee on the deck of a houseboat, or cooking a shared regional breakfast with their hosts. New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand. In this recruiting competition, Airbnb challenges you to predict in which country a new user will make his or her first booking. Kagglers who impress with their answer (and an explanation of how they got there) will be considered for an interview for the opportunity to join Airbnb's Data Science and Analytics team. Wondering if you're a good fit? Check out this article on how Airbnb scaled data science to all sides of their organization, and visit their careers page for more on Airbnb's mission to create a world that inspires human connection.`'",tabular data,Airbnb New User Bookings,recruitment,Where will a new guest book their first travel experience?,NDCG@{K},airbnb-new-user-bookings 6,"'`One of the biggest challenges of an auto dealership purchasing a used car at an auto auction is the risk of that the vehicle might have serious issues that prevent it from being sold to customers. The auto community calls these unfortunate purchases ""kicks"". Kicked cars often result when there are tampered odometers, mechanical issues the dealer is not able to address, issues with getting the vehicle title from the seller, or some other unforeseen problem. Kick cars can be very costly to dealers after transportation cost, throw-away repair work, and market losses in reselling the vehicle. Modelers who can figure out which cars have a higher risk of being kick can provide real value to dealerships trying to provide the best inventory selection possible to their customers. The challenge of this competition is to predict if the car purchased at the Auction is a Kick (bad buy).`'",tabular data,Don't Get Kicked!,featured,Predict if a car purchased at auction is a lemon,Gini,dont-get-kicked! 7,"'`Homework 2 Please check the slide for detail.`'",image data,AI for Clinical Data Analytics HW2,inClass,Computed Tomography Lung Tumor Segmentation Task,dice,ai-for-clinical-data-analytics-hw2 9,"'`Cloud Faculty Institute Workshop (6/22/2018) Predict Diabetes from Medical Records Kaggle InClass Competition Kaggle Kernels`'",tabular data,Cloud Faculty Institute Workshop,inClass,Predict Diabetes from Medical Records,meanfscore,cloud-faculty-institute-workshop 10,"'`Overview The dataset consists of 332987 Chinese characters (1K classes). Train the best classifier based on cnn to outperform other participants.`'",image data,Characters classification,inClass,Chinese сharacters classification problem,categorizationaccuracy,characters-classification 11,'`In this competition street signs will be classified`',image data,ClassificationOFShields,inClass,HSMAWS18/19 Experiment 2,categorizationaccuracy,classificationofshields 12,"'`Introduo Prezados Alunos, Bem vindos terceira atividade de PMR 3508! Nas ltimas duas atividades, ns nos concentramos em atividades de classificao. Desta vez, vamos trabalhar com um problema de regresso! Objetivo Nesta tarefa, vocs tm que construir um modelo que prev o preo mediano de uma casa em uma regio da Califrnia. A descrio completa dos dados disponveis pode ser encontrada na pgina ""Data"". O objetivo desta tarefa que vocs testem toda a gama de modelos de regresso que vocs j aprenderam at agora na matria - KNN, LASSO, RIDGE, rores de deciso entre outros para ver qual deles mais adequado a esta tarefa. A idia que vocs experimentem livremente com os algoritmos e os conheam melhor, entendendo como funcionam seus hiper-parmetros e entendendo um pouco sobre o tradeoff de vis e varincia. Para algumas idias de algoritmos de regresso a usar, recomendo dar uma olhada em: http://scikit-learn.org/stable/supervised_learning.html Avaliao Assim como nas demais atividades, SUA NOTA SER DADA SOMENTE COM BASE NO KERNEL SUBMETIDO - estudantes que no submeterem seus kernels competio tero suas notas zeradas, independentemente de quo bem tenham ido no Leaderboard. Lembramos tambm que necessrio que O KERNEL SEJA CLASSIFICADO COMO PBLICO a fim que o monitor consiga ter acesso a ele para corrig-lo - casos de kernels publicados como privados e s depois da correo mudados para pblicos sofrero desconto na nota. Pontuao As notas desta atividade sero atribudas da seguitne maneira: 1- Completion) 10 pontos sero atribudos ao aluno que fizer a atividade e que apresentar claramente o estudo de ao menos 3 diferentes tipos de regressor 2- Organizao)1 ponto extra ser conferidos aos alunos cujos notebooks estiverem bem organizados e com explicaes claras 3- Explorao) 1 ponto extra ser dado aos alunos que fizerem uma explorao profunda dos dados existentes, com visualizaes ricas e anlises qualitativas dos dados observados 4- Feature Engineering) 2 pontos extras sero dados aos alunos cujo feature engineering for criativo. Embora se tenha poucas features sobre cada um dos lugares, existem vrias features complementares que podem ser criadas para auxiliar no processo regresso, como mdia de quartos por casa, mdia de pessoas por quarto entre outros. Alm disso, so dadas as coordenadas geogrficas de cada um dos locais. Ser que vocs conseguem criar alguma varivel criativa com isso, como a distncia do local at o mar ou a distncia do local at um dos plos do vale do Silcio? Be creative! Para os curiosos, a biblioteca geopy (https://github.com/geopy/geopy) pode ser muito til nesta etapa. Competio Para conferir a esta competio um carter mais competitivo, assim como na primeira, o aluno que obtiver o primeiro lugar na leaderboard de previso ganhar uma barra Lindt de 100g Acknowledgements The data for this competiton was taken from http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html, with special thanks to professor Lus Torgo for making it publicly available.`'",tabular data,Atividade_3_PMR3508,inClass,Terceira atividade do curso PMR3508: regressão.,rmsle,atividade_3_pmr3508 13,"'`In the Global Entrepreneurship Summit, Chh-OLA, the taxi-hailing startup led by Khanchandani, Harsh and Dwivedi garnered a lot of interest and funding from the investors. Five years later, Chh-OLA is successfully running a number of taxis on the streets of New Delhi and garnering significant profit. Now, Chh-OLA has opened a Data Science division recently and you have been recruited into the division. As part of your training, Shruti, the head of the Data Science division, has given you a task of estimating total fare amount of trips using various trip parameters such as distance, passenger count etc. Can you fulfil this task given by your head?`'",tabular data,Chh-OLA,inClass,Help the developers to predict total fare amount of trips,rmse,chh-ola 14,'`The evaluation metric for this competition is Accuracy.`',image data,Aesthetic Visual Analysis,inClass,Image classification by quality,categorizationaccuracy,aesthetic-visual-analysis 15,"'`Demo for ACM. Photo by Sebastian Ervi on Unsplash.`'",tabular data,[ACM] Recommender System Practice,inClass,Recommend talent that users might like!,map@{k},[acm]-recommender-system-practice 16,'`Please refer to the 'San Francisco Crime Classification' Kaggle competition for details.`',tabular data,IIITB ML Project: SFO Crime Classification,inClass,Predict the category of crimes that occurred in the city by the bay,multiclassloss,iiitb-ml-project:-sfo-crime-classification 17,'`This competition is built for evaluating your results on the second phase of the CI course project.`',tabular data,Computational Intelligence Project,inClass,Upload your results for the second phase of the project.,categorizationaccuracy,computational-intelligence-project 18,"'`Cars are one of the most sought after luxury today. They are mostly admired for their sleek design and features. In this task, you are given 45 classes(cars) with 100 images each.Train a CNN on the given dataset and submit the predictions on the test set in a csv file. Download the dataset from data tab and read further instructions.`'",image data,Car Classification(Project Vision),inClass,Identify the cars in images using CNNs,categorizationaccuracy,car-classification(project-vision) 19,"'`Can you find more cat in your dat? We loved the participation and engagement with the first Cat in the Dat competition. Because this is such a common task and important skill to master, we've put together a dataset that contains only categorical features, and includes: binary features low- and high-cardinality nominal features low- and high-cardinality ordinal features (potentially) cyclical features This follow-up competition offers an even more challenging dataset so that you can continue to build your skills with the common machine learning task of encoding categorical variables. This challenge adds the additional complexity of feature interactions, as well as missing data. This Playground competition will give you the opportunity to try different encoding schemes for different algorithms to compare how they perform. We encourage you to share what you find with the community. If you're not sure how to get started, you can check out the Categorical Variables section of Kaggle's Intermediate Machine Learning course. Have Fun!`'",tabular data,Categorical Feature Encoding Challenge II,playground,"Binary classification, with every feature a categorical (and interactions!)",AUC,categorical-feature-encoding-challenge-ii 20,"'`This is week 4 of Kaggle's COVID-19 forecasting series, following the Week 3 competition. This is the 4th competition we've launched in this series. All of the prior discussion forums have been migrated to this competition for continuity. Background The White House Office of Science and Technology Policy (OSTP) pulled together a coalition research groups and companies (including Kaggle) to prepare the COVID-19 Open Research Dataset (CORD-19) to attempt to address key open scientific questions on COVID-19. Those questions are drawn from National Academies of Sciences, Engineering, and Medicines (NASEM) and the World Health Organization (WHO). The Challenge Kaggle is launching a companion COVID-19 forecasting challenges to help answer a subset of the NASEM/WHO questions. While the challenge involves forecasting confirmed cases and fatalities between April 15 and May 14 by region, the primary goal isn't only to produce accurate forecasts. Its also to identify factors that appear to impact the transmission rate of COVID-19. You are encouraged to pull in, curate and share data sources that might be helpful. If you find variables that look like they impact the transmission rate, please share your finding in a notebook. As the data becomes available, we will update the leaderboard with live results based on data made available from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). We have received support and guidance from health and policy organizations in launching these challenges. We're hopeful the Kaggle community can make valuable contributions to developing a better understanding of factors that impact the transmission of COVID-19. Companies and Organizations There is also a call to action for companies and other organizations: If you have datasets that might be useful, please upload them to Kaggles dataset platform and reference them in this forum thread. That will make them accessible to those participating in this challenge and a resource to the wider scientific community. Acknowledgements JHU CSSE for making the data available to the public. The White House OSTP for pulling together the key open questions. The image comes from the Center for Disease Control. This is a Code Competition. Refer to Code Requirements for details.`'",tabular data,COVID19 Global Forecasting (Week 4),research,Forecast daily COVID-19 spread in regions around world,MCRMSLE,covid19-global-forecasting-(week-4) 21,"'`Predict COVID cases! The main goal is to predict covid-19 variable.`'",tabular data,COVID-19 diagnostic,inClass,Detect case using data,auc,covid-19-diagnostic 22,"'`That file you downloaded may contain hidden messages that arent part of its regular contents. The same technology employed for digital watermarking is also misused by crime rings. Law enforcement must now use steganalysis to detect these messages as part of their investigations. Machine learning is an important tool in the discovery of this secret data. Current methods produce unreliable results, raising false alarms. One reason for inaccuracy is the many different devices and processing combinations. Yet, detection models are trained on a homogeneous dataset. To increase accuracy, researchers must put data hidden within digital images into the wild (hence the name ALASKA) to mimic real world conditions. In the competition, youll create an efficient and reliable method to detect secret data hidden within innocuous-seeming digital images. Rather than limiting the data source, these images have been acquired with as many as 50 different cameras (from smartphone to full-format high end) and processed in different fashions. Successful entries will include robust detection algorithms with minimal false positives. The IEEE WIFS (Workshop on Information Forensics and Security) is eager to make this happen again, as a follow up to the ALASKA#1 Challenge. WIFS is an annual event where researchers gather to discuss emerging challenges, exchange fresh ideas, and share state-of-the-art results and technical expertise in the areas of information security and forensics. WIFS has teamed up with Troyes University of Technology, CRIStAL Lab, Lille University, and CNRS to enable more accurate steganalysis. Law enforcement officers need better methods to combat criminals using hidden messages. The data science community and other researchers can help with better automated detection. More accurate methods could help catch criminals whose communications are hidden in plain sight. The challenge is organized by Rmi COGRANNE (UTT), Patrick BAS (CRIStAL / CNRS) and Quentin Giboulot (UTT) ; in addition to Kaggle, we have been greatly helped by the following sponsors:`'",image data,ALASKA2 Image Steganalysis,research,Detect secret data hidden within digital images,WeightedAUC,alaska2-image-steganalysis 23,"'`Running a thriving local restaurant isn't always as charming as first impressions appear. There are often all sorts of unexpected troubles popping up that could hurt business. One common predicament is that restaurants need to know how many customers to expect each day to effectively purchase ingredients and schedule staff members. This forecast isn't easy to make because many unpredictable factors affect restaurant attendance, like weather and local competition. It's even harder for newer restaurants with little historical data. Recruit Holdings has unique access to key datasets that could make automated future customer prediction possible. Specifically, Recruit Holdings owns Hot Pepper Gourmet (a restaurant review service), AirREGI (a restaurant point of sales service), and Restaurant Board (reservation log management software). In this competition, you're challenged to use reservation and visitation data to predict the total number of visitors to a restaurant for future dates. This information will help restaurants be much more efficient and allow them to focus on creating an enjoyable dining experience for their customers.`'",tabular data,Recruit Restaurant Visitor Forecasting,,Predict how many future visitors a restaurant will receive,RMSLE,recruit-restaurant-visitor-forecasting 24,"'`In this competition, youre challenged to build an algorithm to detect a visual signal for pneumonia in medical images. Specifically, your algorithm needs to automatically locate lung opacities on chest radiographs. Heres the backstory and why solving this problem matters. Pneumonia accounts for over 15% of all deaths of children under 5 years old internationally. In 2015, 920,000 children under the age of 5 died from the disease. In the United States, pneumonia accounts for over 500,000 visits to emergency departments [1] and over 50,000 deaths in 2015 [2], keeping the ailment on the list of top 10 causes of death in the country. While common, accurately diagnosing pneumonia is a tall order. It requires review of a chest radiograph (CXR) by highly trained specialists and confirmation through clinical history, vital signs and laboratory exams. Pneumonia usually manifests as an area or areas of increased opacity [3] on CXR. However, the diagnosis of pneumonia on CXR is complicated because of a number of other conditions in the lungs such as fluid overload (pulmonary edema), bleeding, volume loss (atelectasis or collapse), lung cancer, or post-radiation or surgical changes. Outside of the lungs, fluid in the pleural space (pleural effusion) also appears as increased opacity on CXR. When available, comparison of CXRs of the patient taken at different time points and correlation with clinical symptoms and history are helpful in making the diagnosis. CXRs are the most commonly performed diagnostic imaging study. A number of factors such as positioning of the patient and depth of inspiration can alter the appearance of the CXR [4], complicating interpretation further. In addition, clinicians are faced with reading high volumes of images every shift. To improve the efficiency and reach of diagnostic services, the Radiological Society of North America (RSNA) has reached out to Kaggles machine learning community and collaborated with the US National Institutes of Health, The Society of Thoracic Radiology, and MD.ai to develop a rich dataset for this challenge. The RSNA is an international society of radiologists, medical physicists and other medical professionals with more than 54,000 members from 146 countries across the globe. They see the potential for ML to automate initial detection (imaging screening) of potential pneumonia cases in order to prioritize and expedite their review. Challenge participants may be invited to present their AI models and methodologies during an award ceremony at the RSNA Annual Meeting which will be held in Chicago, Illinois, USA, from November 25-30, 2018. Acknowledgements Thank you to the National Institutes of Health Clinical Center for publicly providing the Chest X-Ray dataset [5]. NIH News release: NIH Clinical Center provides one of the largest publicly available chest x-ray datasets to scientific community Original source files and documents Also, a big thank you to the competition organizers! References Rui P, Kang K. National Ambulatory Medical Care Survey: 2015 Emergency Department Summary Tables. Table 27. Available from: www.cdc.gov/nchs/data/nhamcs/webtables/2015edwebtables.pdf Deaths: Final Data for 2015. Supplemental Tables. Tables I-21, I-22. Available from: www.cdc.gov/nchs/data/nvsr/nvsr66/nvsr6606tables.pdf Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297 Kelly B. The Chest Radiograph. Ulster Med J 2012;81(3):143-148 Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE CVPR 2017, http://openaccess.thecvf.com/content_cvpr_2017/papers/Wang_ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.pdf`'",image data,RSNA Pneumonia Detection Challenge,,Can you build an algorithm that automatically detects potential pneumonia cases?,RSNAObjectDetectionAP,rsna-pneumonia-detection-challenge 25,"'`Most flight-related fatalities stem from a loss of airplane state awareness. That is, ineffective attention management on the part of pilots who may be distracted, sleepy or in other dangerous cognitive states. Your challenge is to build a model to detect troubling events from aircrews physiological data. You'll use data acquired from actual pilots in test situations, and your models should be able to run calculations in real time to monitor the cognitive states of pilots. With your help, pilots could then be alerted when they enter a troubling state, preventing accidents and saving lives. Reducing aircraft fatalities is just one of the complex problems that Booz Allen Hamilton has been solving for business, government, and military leaders for over 100 years. Through devotion, candor, courage, and character, they produce original solutions where there are no roadmaps. Now you can help them find answers, save lives, and change the world.`'",tabular data,Reducing Commercial Aviation Fatalities,,Can you tell when a pilot is heading for trouble?,MulticlassLoss,reducing-commercial-aviation-fatalities 26,"'`Do you have your fathers nose? Blood relatives often share facial features. Now researchers at Northeastern University want to improve their algorithm for facial image classification to bridge the gap between research and other familial markers like DNA results. That will be your challenge in this new Kaggle competition. An automatic kinship classifier has been in the works at Northeastern since 2010. Yet this technology remains largely unseen in practice for a couple of reasons: 1. Existing image databases for kinship recognition tasks aren't large enough to capture and reflect the true data distributions of the families of the world. 2. Many hidden factors affect familial facial relationships, so a more discriminant model is needed than the computer vision algorithms used most often for higher-level categorizations (e.g. facial recognition or object classification). In this competition, youll help researchers build a more complex model by determining if two people are blood-related based solely on images of their faces. If you think you can get it ""on the nose,"" this competition is for you. The SMILE Lab at Northeastern focuses on the frontier research of applied machine learning, social media analytics, human-computer interaction, and high-level image and video understanding. Their research is driven by the explosion of diverse multimedia from the Internet, including both personal and publicly-available photos and videos. They start by treating fundamental theory from learning algorithms as the soul of machine intelligence and arm it with visual perception.`'",image data,Northeastern SMILE Lab - Recognizing Faces in the Wild,,Can you determine if two individuals are related?,AUC,northeastern-smile-lab-recognizing-faces-in-the-wild 27,"'`The cost of some drugs and medical treatments has risen so high in recent years that many patients are having to go without. You can help with a classification project that could make researchers more efficient. One of the more surprising reasons behind the cost is how long it takes to bring new treatments to market. Despite improvements in technology and science, research and development continues to lag. In fact, finding new treatments takes, on average, more than 10 years and costs hundreds of millions of dollars. Recursion Pharmaceuticals, creators of the industrys largest dataset of biological images, generated entirely in-house, believes AI has the potential to dramatically improve and expedite the drug discovery process. More specifically, your efforts could help them understand how drugs interact with human cells. This competition will have you disentangling experimental noise from real biological signals. Your entry will classify images of cells under one of 1,108 different genetic perturbations. You can help eliminate the noise introduced by technical execution and environmental variation between experiments. If successful, you could dramatically improve the industrys ability to model cellular images according to their relevant biology. In turn, applying AI could greatly decrease the cost of treatments, and ensure these treatments get to patients faster. This competition is a part of the NeurIPS 2019 competition track. Winners will be invited to contribute their solutions towards the workshop presentation. Acknowledgments Thank you to the following sponsors & supporters of this competition: Google Cloud: Google Cloud is widely recognized as a global leader in delivering a secure, open and intelligent enterprise cloud platform. Our technology is built on Googles private network and is the product of nearly 20 years of innovation in security, network architecture, collaboration, artificial intelligence and open source software. We offer a simply engineered set of tools and unparalleled technology across Google Cloud Platform and G Suite that help bring people, insights and ideas together. Customers across more than 150 countries trust Google Cloud to modernize their computing environment for todays digital world. DoiT: You have the cloud and we have your back. For nearly a decade, weve been helping businesses build and scale cloud solutions with our world-class cloud engineering support. We help our customers with technical support and consulting on building and operating complex large-scale distributed systems, developing better machine learning models and setting up big data solutions using Google Cloud, Amazon AWS and Microsoft Azure. NVIDIA: NVIDIAs (NASDAQ: NVDA) invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI the next era of computing with the GPU acting as the brain of computers, robots and self-driving cars that can perceive and understand the world. More information at http://nvidianews.nvidia.com. Lambda: Lambda provides Deep Learning workstations, servers, and GPU cloud services. Lambda Deep Learning infrastructure is used by the world's leading AI research & development organizations including Apple, Microsoft, MIT, Stanford, and the US Government. To learn more, visit www.lambdalabs.com.`'",image data,Recursion Cellular Image Classification,,CellSignal: Disentangling biological signal from experimental noise in cellular images,CategorizationAccuracy,recursion-cellular-image-classification 28,"'`Intracranial hemorrhage, bleeding that occurs inside the cranium, is a serious health problem requiring rapid and often intensive medical treatment. For example, intracranial hemorrhages account for approximately 10% of strokes in the U.S., where stroke is the fifth-leading cause of death. Identifying the location and type of any hemorrhage present is a critical step in treating the patient. Diagnosis requires an urgent procedure. When a patient shows acute neurological symptoms such as severe headache or loss of consciousness, highly trained specialists review medical images of the patients cranium to look for the presence, location and type of hemorrhage. The process is complicated and often time consuming. In this competition, your challenge is to build an algorithm to detect acute intracranial hemorrhage and its subtypes. Youll develop your solution using a rich image dataset provided by the Radiological Society of North America (RSNA) in collaboration with members of the American Society of Neuroradiology and MD.ai. If successful, youll help the medical community identify the presence, location and type of hemorrhage in order to quickly and effectively treat affected patients. Challenge participants may be invited to present their AI models and methodologies during an award ceremony at the RSNA Annual Meeting which will be held in Chicago, Illinois, USA, from December 1-6, 2019. Collaborators Four research institutions provided large volumes of de-identified CT studies that were assembled to create the challenge dataset: Stanford University, Thomas Jefferson University, Unity Health Toronto and Universidade Federal de So Paulo (UNIFESP), The American Society of Neuroradiology (ASNR) organized a cadre of more than 60 volunteers to label over 25,000 exams for the challenge dataset. ASNR is the worlds leading organization for the future of neuroradiology representing more than 5,300 radiologists, researchers, interventionalists, and imaging scientists. MD.ai provided tooling and support for the data annotation process. The RSNA is an international society of radiologists, medical physicists and other medical professionals with more than 54,000 members from 146 countries across the globe. They see the potential for AI to assist in detection and classification of hemorrhages in order to prioritize and expedite their clinical work. A full set of acknowledgments can be found on this page.`'",image data,RSNA Intracranial Hemorrhage Detection,,Identify acute intracranial hemorrhage and its subtypes,WeightedMeanColumnwiseLogLoss,rsna-intracranial-hemorrhage-detection 29,"'`If every breath is strained and painful, it could be a serious and potentially life-threatening condition. A pulmonary embolism (PE) is caused by an artery blockage in the lung. It is time consuming to confirm a PE and prone to overdiagnosis. Machine learning could help to more accurately identify PE cases, which would make management and treatment more effective for patients. Currently, CT pulmonary angiography (CTPA), is the most common type of medical imaging to evaluate patients with suspected PE. These CT scans consist of hundreds of images that require detailed review to identify clots within the pulmonary arteries. As the use of imaging continues to grow, constraints of radiologists time may contribute to delayed diagnosis. The Radiological Society of North America (RSNA) has teamed up with the Society of Thoracic Radiology (STR) to help improve the use of machine learning in the diagnosis of PE. In this competition, youll detect and classify PE cases. In particular, you'll use chest CTPA images (grouped together as studies) and your data science skills to enable more accurate identification of PE. If successful, you'll help reduce human delays and errors in detection and treatment. With 60,000-100,000 PE deaths annually in the United States, it is among the most fatal cardiovascular diseases. Timely and accurate diagnosis will help these patients receive better care and may also improve outcomes. This is a Code Competition. Refer to Code Requirements for details. Acknowledgments The Radiological Society of North America (RSNA) is an international society of radiologists, medical physicists, and other medical professionals with more than 53,400 members worldwide. RSNA hosts the worlds premier radiology forum and publishes two top peer-reviewed journals: Radiology, the highest-impact scientific journal in the field, and RadioGraphics, the only journal dedicated to continuing education in radiology. The Society of Thoracic Radiology (STR) was founded in 1982. The STR is dedicated to advancing cardiothoracic imaging in clinical application, education, and research in radiology and allied disciplines. Continuing professional development opportunities provided by the STR include educational and scientific meetings, mentorship programs, grant support and award opportunities, our society journal, Journal of Thoracic Imaging, and global collaboration activities. A full set of acknowledgments can be found on this page.`'",image data,RSNA STR Pulmonary Embolism Detection,,Classify Pulmonary Embolism cases in chest CT scans,WeightedMeanColumnwiseLogLoss,rsna-str-pulmonary-embolism-detection 30,"'`Riiid AIEd Challenge 2020 Challenge Website Thank you for all those who attended the AAAI-2021 workshop on AI Education! Prize-winning teams presented their models at the AAAI-2021 Workshop on AI Education - Imagining Post-COVID Education with AI - on February 9, 2021. You can find the model write-ups on the workshop website. Think back to your favorite teacher. They motivated and inspired you to learn. And they knew your strengths and weaknesses. The lessons they taught were based on your ability. For example, teachers would make sure you understood algebra before advancing to calculus. Yet, many students dont have access to personalized learning. In a world full of information, data scientists like you can help. Machine learning can offer a path to success for young people around the world, and you are invited to be part of this mission. In 2018, 260 million children weren't attending school. At the same time, more than half of these young students didn't meet minimum reading and math standards. Education was already in a tough place when COVID-19 forced most countries to temporarily close schools. This further delayed learning opportunities and intellectual development. The equity gaps in every country could grow wider. We need to re-think the current education system in terms of attendance, engagement, and individualized attention. Riiid Labs, an AI solutions provider delivering creative disruption to the education market, empowers global education players to rethink traditional ways of learning leveraging AI. With a strong belief in equal opportunity in education, Riiid launched an AI tutor based on deep-learning algorithms in 2017 that attracted more than one million South Korean students. This year, the company released EdNet, the worlds largest open database for AI education containing more than 100 million student interactions. In this competition, your challenge is to create algorithms for ""Knowledge Tracing,"" the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions. You will pair your machine learning skills using Riiids EdNet data. Your innovative algorithms will help tackle global challenges in education. If successful, its possible that any student with an Internet connection can enjoy the benefits of a personalized learning experience, regardless of where they live. With your participation, we can build a better and more equitable model for education in a post-COVID-19 world. Acknowledgements Academic Advisors Paul Kim, Stanford Graduate School of Education Neil Heffernan, WPI & ASSISTments Partners`'",tabular data,Riiid Answer Correctness Prediction,,Track knowledge states of 1M+ students in the wild,AUC,riiid-answer-correctness-prediction 31,"'`This competition is used as your final project for CS055241 - HCMUT. In this competition, your task is to build a machine learning model to classify images based on their training data. You are allowed to use any types of machine learning techniques to complete this challenge. The result is evaluated based on the accuracy of the output from your model. Please refer to Data page for more detail about the dataset. This is an individual challenge. The final grade for each student will be based on the student's ranking. Please do not share your code and result. Have fun! Rang Nguyen`'",image data,Image Classification,inClass,Classify JAFFE images,categorizationaccuracy,image-classification 32,"'`Objective In this homework, we will train a deep learning model that can classify handwritten letters and digits in Extended-MNIST (EMNIST) dataset. The EMNIST include 52 letters (capital and small letters) and 10 digits. We will use the default test dataset (emnist-byclass-test) to evaluate your model. You are free to use any neural network architecture, any DL library (PyTorch, TensorFlow, ...), or even any language (C, Java, JavaScript). This is just a warm-up homework so don't worry about the accuracy too much. As long as your accuracy is better than random guess, you will get the credit. Please register a Kaggle account with your student ID. Because Kaggle only accepts output results, please upload your source code to NTUT i-School. Input Data We've converted the EMNIST binary data to NumPy npz format. The code below shows how to load training data: import numpy as np # Load training data data = np.load('emnist-byclass-train.npz') train_labels = data['training_labels'] train_data = data['training_images'] # Load testing data test_data = np.load('emnist-byclass-test.npz')['testing_images'] You are free to download the original EMNIST from its official website. However DON'T USE ByClass Testing Images for Training! The statistics of the EMNIST is shown in the table below: The By_Class split includes 62 classes comprising [0-9], [a-z] and [A-Z], while the By_Merge split has merged the uppercase and lowercase letters of C, I, J, K, L, M, O, P, S, U, V, W, X, Y and Z, and include only 47 classes. Output Format A submission file of prediction results should be a CSV file containing two columns: Id and Category, separated by ','. The file should contain a header and have the following format: Id,Category 0,15 1,7 2,30 3,22 ... We will use Category Accuracy as the metric. There are 62 classes (0 ~ 9 & A ~ Z & a ~z) in ByClass split of EMNIST dataset, which are labeled sequentially as 0, 1, 2, , 61. Kaggel Notebook Example You can refer to the notebook to see how to train a model on Kaggle: https://www.kaggle.com/kuantinglai/kernel172124573b References EMNIST Official Website EMNIST paper`'",image data,108-1 NTUT Machine Learning Homework 1,inClass,Classifying Handwritten Letters and Digits in Extended MNIST dataset,categorizationaccuracy,108-1-ntut-machine-learning-homework-1 33,"'`In Homework 2 we are going to train models that can recommend hashtags based on the content of an image. We will use the dataset ""HARRISON"". Please refer to the paper (https://arxiv.org/abs/1605.05054) for more details about the dataset. The original data of HARRISON can be downloaded here: (https://drive.google.com/file/d/1B9NZf42J_GpslRNlTTAxvM5c9WRVl_1Z/view?usp=sharing). You can download the images and take a look, but this is not mandatory. We already extracted the image features for you and upload to Kaggle as ""harrison_features.npz"". The demo code separates the training process into two stages: word2vec training and image-to-word2vec training. To evaluate the performance, the top 10 hashtags (KNN=10) recommended from your model will be used to calculate the hamming loss of ground truth. Get your computer ready and have fun!`'",image data,108-2 NTUT DL APP HW2 Image Hashtag Recommendation,,Recommend Most Relevant Hashtags based on the Content of Images,hammingloss,108-2-ntut-dl-app-hw2-image-hashtag-recommendation 34,"'`Learning Goal In this homework, we choose ""FrozenLake8x8-v0"" as our learning environment. The goal of FrozenLake is to go from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H). However, the ice is slippery, so you won't always move in the direction you intend (stochastic environment). Observation Type: Discrete (64) Num Observation (State) 0 - 63 For 8x8 square, counting each position from left to right, top to bottom Actions Type: Discrete(4) Num Action 0 Move Left 1 Move Down 2 Move Right 3 Move Up Reward Reward is 0 for every step taken, 0 for falling in the hole, 1 for reaching the final goal. Starting State Starting state is at the top left corner. Episode Termination Reaching the goal or fall into one of the holes.`'",tabular data,108-2-NTUT-DRL-HW1,inClass,NTUT Deep Reinforcement Learning Homework 1: Frozen Lake,ae,108-2-ntut-drl-hw1 35,"'`In this homework, we are going to train an agent that can drive in CarRacing environment in OpenAI Gym. [link](https://gym.openai.com/envs/CarRacing-v0/) CarRacing is a top-down racing environment which consists of 96x96 pixels. Reward is -0.1 every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points. Episode finishes when all tiles are visited. Some indicators shown at the bottom of the window and the state RGB buffer. From left to right: true speed, four ABS sensors, steering wheel position, gyroscope. CarRacing-v0 defines ""solving"" as getting average reward of 900 over 100 consecutive trials.`'",tabular data,NTUT DRL Homework 2 - Car Racing,,Training an agent that can race on the tracks,ae,ntut-drl-homework-2-car-racing 36,"'`This isn't your classic decoder ring puzzle found in a cereal box. There's a twist. Welcome to the Ciphertext Challenge! In this competition, we've encrypted parts of a well-known dataset -- the 20 Newsgroups dataset -- with several simple, classic ciphers. This dataset is commonly used as a multi-class and NLP sample set, noted for its small size, varied nature, and the first-hand look it offers into the deep existential horrors of the 90s-era internet. With 20 fairly distinct classes and lots of clues, it allows for a wide variety of successful approaches. We've made the problem a little harder to solve. Fabulous Kaggle swag will go to the top competitors - the highest-scoring teams (which might be the first to crack the code!), and the most popular kernel. Note that this is a short competition, so use your submissions wisely. * = Note: It is possible to apply a number of techniques using ONLY the ciphertext. Acknowledgements Kaggle is hosting this competition for the machine learning community to use for fun and practice. You can view and download the unencrypted dataset from Jason Rennie's homepage. In the words of the host: The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. If you use the dataset in a scientific publication, please reference (at a minimum) the above website. Photo by U.S. Air Force photo/Don Branum`'",text data,20 Newsgroups Ciphertext Challenge,,V8g{9827,MacroFScore,20-newsgroups-ciphertext-challenge 37,"'`Competition Description MNIST (""Modified National Institute of Standards and Technology"") is the de facto hello world dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike. In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. `'",image data,MNIST (uglier),inClass,Everything I touch turns ugly,categorizationaccuracy,mnist-(uglier) 38,'`--`',image data,2019-PR-Midterm-ImageClassification,inClass,Midterm #4,categorizationaccuracy,2019-pr-midterm-imageclassification 39,"'` 50% . https://github.com/sejongresearch/2019.Fall.PatternRecognition/issues/26`'",tabular data,2019-PR-Midterm-MusicClassification,inClass,Midterm #5,categorizationaccuracy,2019-pr-midterm-musicclassification 40,"'` competition , . uci repository, 1200 train set 1000, test set 200 . : http://mlr.cs.umass.edu/ml/machine-learning-databases/abalone/abalone.names : https://www.youtube.com/watch?v=x0WUEP-GOKk : pptx `'",tabular data,2020-abalone-age,inClass,predict abalone's age,rmse,2020-abalone-age 41,'`CIFAR 100 dataset predictions`',image data,2020 MPCS53111 HW5 CIFAR100,,2020 MPCS53111 HW5 CIFAR100,categorizationaccuracy,2020-mpcs53111-hw5-cifar100 42,'`Fashion MNIST dataset predictions`',image data,2020 MPCS53111 HW5 FashionMNIST,,2020 MPCS53111 HW5 FashionMNIST,categorizationaccuracy,2020-mpcs53111-hw5-fashionmnist 43,"'`2020 AI Cd, Cu, As, Hg, Pb, Zn, Ni .`'",tabular data,2020ai_soil,,2020년 AI 프로젝트 : 토양 오염의 수치를 예측해보자,rmse,2020ai_soil 44,"'`We often face the problem of searching meaningful emails from thousands of promotional emails. This challenge focuses on creating a multi-class classifier that can classify an email into four classes based on the metadata extracted from the email. See data description to know more about the data.`'",tabular data,2EL1730: Machine Learning,,Classify an email into four classes based on the metadata extracted from the emails.,meanfscore,2el1730:-machine-learning 45,"'`Self-driving technology presents a rare opportunity to improve the quality of life in many of our communities. Avoidable collisions, single-occupant commuters, and vehicle emissions are choking cities, while infrastructure strains under rapid urban growth. Autonomous vehicles are expected to redefine transportation and unlock a myriad of societal, environmental, and economic benefits. You can apply your data analysis skills in this competition to advance the state of self-driving technology. Lyft, whose mission is to improve peoples lives with the worlds best transportation, is investing in the future of self-driving vehicles. Level 5, their self-driving division, is working on a fleet of autonomous vehicles, and currently has a team of 450+ across Palo Alto, London, and Munich working to build a leading self-driving system (theyre hiring!). Their goal is to democratize access to self-driving technology for hundreds of millions of Lyft passengers. From a technical standpoint, however, the bar to unlock technical research and development on higher-level autonomy functions like perception, prediction, and planning is extremely high. This implies technical R&D on self-driving cars has traditionally been inaccessible to the broader research community. This dataset aims to democratize access to such data, and foster innovation in higher-level autonomy functions for everyone, everywhere. By conducting a competition, we hope to encourage the research community to focus on hard problems in this spacenamely, 3D object detection over semantic maps. In this competition, you will build and optimize algorithms based on a large-scale dataset. This dataset features the raw sensor camera inputs as perceived by a fleet of multiple, high-end, autonomous vehicles in a restricted geographic area. If successful, youll make a significant contribution towards stimulating further development in autonomous vehicles and empowering communities around the world.`'",image data,Lyft 3D Object Detection for Autonomous Vehicles,,Can you advance the state of the art in 3D object detection? ,Lyft3DObjectDetectionAP,lyft-3d-object-detection-for-autonomous-vehicles 46,"'`Introduction Welcome to the Professor Hyoseok Hwang's Deep Learning class(2020) classification competition. Goal of Competition Through this competition, we hope to improve deep learning skills and improve PyTorch coding skills so that you can finally have a wide perspective on the deep learning model. Dataset Description We offer a total of three types of image datasets: cheetahs, jaguars, hyenas, and tigers. All images size is 400(H) * 400(W) * 3(RGB) In each class include, 900 training images and 100 validation images are provided. Each label of an image is specified in the image name. and its folder name Therefore, you can use the image name to create classification ground truth. Totally, 4K labeled images are provided for the training dataset You can see 404 test images which are contaminated but label is not provided We look forward to seeing good results. `'",image data,testtesttesttest,inClass,;l;l,categorizationaccuracy,testtesttesttest 47,"'`Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. The goal of this competition is to build a model that borrowers can use to help make the best financial decisions. Historical data are provided on 250,000 borrowers and the prize pool is $5,000 ($3,000 for first, $1,500 for second and $500 for third).`'",tabular data,Give Me Some Credit,featured,Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years. ,AUC,give-me-some-credit 48,"'`Bored of MNIST? The goal of this competition is to provide a simple extension to the classic MNIST competition we're all familiar with. Instead of using Arabic numerals, it uses a recently-released dataset of Kannada digits. Kannada is a language spoken predominantly by people of Karnataka in southwestern India. The language has roughly 45 million native speakers and is written using the Kannada script. Wikipedia This competition uses the same format as the MNIST competition in terms of how the data is structured, but it's different in that it is a synchronous re-run Kernels competition. You write your code in a Kaggle Notebook, and when you submit the results, your code is scored on both the public test set, as well as a private (unseen) test set. Technical Information All details of the dataset curation has been captured in the paper titled: Prabhu, Vinay Uday. ""Kannada-MNIST: A new handwritten digits dataset for the Kannada language."" arXiv preprint arXiv:1908.01242 (2019) The github repo of the author can be found here. On the originally-posted dataset, the author suggests some interesting questions you may be interested in exploring. Please note, although this dataset has been released in full, the purpose of this competition is for practice, not to find the labels to submit a perfect score. In addition to the main dataset, the author also disseminated an additional real world handwritten dataset (with 10k images), termed as the 'Dig-MNIST dataset' that can serve as an out-of-domain test dataset. It was created with the help of volunteers that were non-native users of the language, authored on a smaller sheet and scanned with different scanner settings compared to the main dataset. This 'dig-MNIST' dataset serves as a more difficult test-set (An accuracy of 76.1% was reported in the paper cited above) and achieving ~98+% accuracy on this test dataset would be rather commendable. Acknowledgments Kaggle thanks Vinay Prabhu for providing this interesting dataset for a Playground competition. Image reference: https://www.researchgate.net/figure/speech-for-Kannada-numbers_fig2_313113588`'",image data,Kannada MNIST,,MNIST like datatset for Kannada handwritten digits,CategorizationAccuracy,kannada-mnist 49,"'`Welcome In this challenge you'll notice there isn't a leaderboard, and you are not required to develop a predictive model. This isn't a traditional supervised Kaggle machine learning competition. Instead, this challenge asks you to use data to propose specific rule modifications for the NFL that aim to reduce the occurrence of concussions during punt plays. For more information on this challenge format, see this forum thread. This challenge is part of NFL 1st & Future, presented by Arrow Electronics the NFLs annual Super Bowl competition designed to spur innovation in player health, safety and performance. The Challenge For the 2018 season, the NFL revised their kickoff rules in an effort to reduce the risk of injury during those plays. By examining injury reports, player position and velocity data, and game video, they were able to understand the game-play circumstances that may exacerbate the risk of injury to players. This comprehensive review showed that over the course of all games during the 2015-2017 seasons, the kickoff represented only six percent of plays but 12 percent of concussions. Players had approximately four times the risk of concussion on returned kickoffs compared to running or passing plays. The changes to the kickoff rule aim to address the components that posed the most risk, like the use of a two-man wedge. Now, the NFL is challenging Kagglers to help them perform the same examination, this time on punt play rules. They have provided data for all punt plays from the 2016 and 2017 NFL seasons that includes player rosters, on-field position data and video data, including the plays in which a player suffered a concussion. Your challenge is to propose specific rule modifications (e.g. changes to the initial formation, tackling techniques, blocking rules etc.), supported by data, that may reduce the occurrence of concussions during punt plays. More details on the entry criteria are available in Overview tab > Evaluation. About The NFL The National Football League is America's most popular sports league, comprised of 32 franchises that compete each year to win the Super Bowl, the world's biggest annual sporting event. Founded in 1920, the NFL developed the model for the successful modern sports league, including national and international distribution, extensive revenue sharing, competitive excellence, and strong franchises across the country. The NFL is committed to advancing progress in the diagnosis, prevention and treatment of sports-related injuries. The NFL's ongoing health and safety efforts include support for independent medical research and engineering advancements and a commitment to look at anything and everything to protect players and make the game safer, including enhancements to medical protocols and improvements to how our game is taught and played. As more is learned, the league evaluates and changes rules to evolve the game and try to improve protections for players. Since 2002 alone, the NFL has made 50 rules changes intended to eliminate potentially dangerous tactics and reduce the risk of injuries. For more information about the NFL's health and safety efforts, please visit www.PlaySmartPlaySafe.com. Evaluation`'",tabular data,NFL Punt Analytics Competition,,Analyze NFL game data and suggest rules to improve player safety during punt plays,football,nfl-punt-analytics-competition 50,"'`Help some of the world's leading astronomers grasp the deepest properties of the universe. The human eye has been the arbiter for the classification of astronomical sources in the night sky for hundreds of years. But a new facility -- the Large Synoptic Survey Telescope (LSST) -- is about to revolutionize the field, discovering 10 to 100 times more astronomical sources that vary in the night sky than we've ever known. Some of these sources will be completely unprecedented! The Photometric LSST Astronomical Time-Series Classification Challenge (PLAsTiCC) asks Kagglers to help prepare to classify the data from this new survey. Competitors will classify astronomical sources that vary with time into different classes, scaling from a small training set to a very large test set of the type the LSST will discover. More background information is available here. Acknowledgements PLAsTiCC is funded through LSST Corporation Grant Award # 2017-03 and administered by the University of Toronto. Financial support for LSST comes from the National Science Foundation (NSF) through Cooperative Agreement No. 1258333, the Department of Energy (DOE) Office of Science under Contract No. DE-AC02-76SF00515, and private funding raised by the LSST Corporation. The NSF-funded LSST Project Office for construction was established as an operating center under management of the Association of Universities for Research in Astronomy (AURA). The DOE-funded effort to build the LSST camera is managed by the SLAC National Accelerator Laboratory (SLAC). The National Science Foundation (NSF) is an independent federal agency created by Congress in 1950 to promote the progress of science. NSF supports basic research and people to create knowledge that transforms the future. Photo Credit: M. Park/Inigo Films/LSST/AURA/NSF`'",image data,PLAsTiCC Astronomical Classification,,Can you help make sense of the Universe?,WeightedMulticlassLoss,plasticc-astronomical-classification 51,"'`Can a computer learn complex, abstract tasks from just a few examples? Current machine learning techniques are data-hungry and brittlethey can only make sense of patterns they've seen before. Using current methods, an algorithm can gain new skills by exposure to large amounts of data, but cognitive abilities that could broadly generalize to many tasks remain elusive. This makes it very challenging to create systems that can handle the variability and unpredictability of the real world, such as domestic robots or self-driving cars. However, alternative approaches, like inductive programming, offer the potential for more human-like abstraction and reasoning. The Abstraction and Reasoning Corpus (ARC) provides a benchmark to measure AI skill-acquisition on unknown tasks, with the constraint that only a handful of demonstrations are shown to learn a complex task. It provides a glimpse of a future where AI could quickly learn to solve new problems on its own. The Kaggle Abstraction and Reasoning Challenge invites you to try your hand at bringing this future into the present! This competition is hosted by Franois Chollet, creator of the Keras neural networks library. Chollets paper on measuring intelligence provides the context and motivation behind the ARC benchmark. In this competition, youll create an AI that can solve reasoning tasks it has never seen before. Each ARC task contains 3-5 pairs of train inputs and outputs, and a test input for which you need to predict the corresponding output with the pattern learned from the train examples. If successful, youll help bring computers closer to human cognition and you'll open the door to completely new AI applications!`'",tabular data,Abstraction and Reasoning Challenge,,Create an AI capable of solving reasoning tasks it has never seen before,MeanBestErrorAtK,abstraction-and-reasoning-challenge 52,"'`AI AcademyTime Series DataRobotDataRobot SubmissioncsvSubmit Predictions 2014-11-01 ~ 2014-12-31(storeid)(prodid)""Sales_qty"" Forum Rules 5/12() 23:59`'",time series,202003 AI Academy Time Series Assignment,inClass,Time series assignment for AI Academy students,mae,202003-ai-academy-time-series-assignment 53,"'`Two years of mobile behavior, 67 million clicks, 27 million searches, 8 million users, 1 million products There will also be a Visualization Contest that can be entered from either track. For more details on the event, go to: http://www.sfbayacm.org/DM-Hackathon-2012-10 Data Provided by: Cloud Compute Sponsors: `'",tabular data,Data Mining Hackathon on (20 mb) Best Buy mobile web site - ACM SF Bay Area Chapter,research,Getting Started - Predict which Xbox game a visitor will be most interested in based on their search query. (20 MB),MAP@k,data-mining-hackathon-on-(20-mb)-best-buy-mobile-web-site-acm-sf-bay-area-chapter 54,"'`Consumer brands often offer discounts to attract new shoppers to buy their products. The most valuable customers are those who return after this initial incented purchase. With enough purchase history, it is possible to predict which shoppers, when presented an offer, will buy a new item. However, identifying the shopper who will become a loyal buyer -- prior to the initial purchase -- is a more challenging task. The Acquire Valued Shoppers Challenge asks participants to predict which shoppers are most likely to repeat purchase. To aid with algorithmic development, we have provided complete, basket-level, pre-offer shopping history for a large set of shoppers who were targeted for an acquisition campaign. The incentive offered to that shopper and their post-incentive behavior is also provided. This challenge provides almost 350 million rows of completely anonymised transactional data from over 300,000 shoppers. It is one of the largest problems run on Kaggle to date.`'",tabular data,Acquire Valued Shoppers Challenge,,Predict which shoppers will become repeat buyers,AUC,acquire-valued-shoppers-challenge 55,"'`To assess the impact of climate change on Earth's flora and fauna, it is vital to quantify how human activities such as logging, mining, and agriculture are impacting our protected natural areas. Researchers in Mexico have created the VIGIA project, which aims to build a system for autonomous surveillance of protected areas. A first step in such an effort is the ability to recognize the vegetation inside the protected areas. In this competition, you are tasked with creation of an algorithm that can identify a specific type of cactus in aerial imagery. This is a kernels-only competition, meaning you must submit predictions using Kaggle Kernels. Read the basics here. Acknowledgments Kaggle is hosting this competition for the machine learning community to use for fun and practice. The original version of this data can be found here, with details in the following paper: Efren Lpez-Jimnez, Juan Irving Vasquez-Gomez, Miguel Angel Sanchez-Acevedo, Juan Carlos Herrera-Lozada, Abril Valeria Uriarte-Arcia, Columnar Cactus Recognition in Aerial Images using a Deep Learning Approach. Ecological Informatics. 2019. Acknowledgements to Consejo Nacional de Ciencia y Tecnologa. Project ctedra 1507. Instituto Politcnico Nacional. Universidad de la Caada. Contributors: Eduardo Armas Garca, Rafael Cano Martnez and Luis Cresencio Mota Carrera. J.I. Vasquez-Gomez, JC. Herrera Lozada. Abril Uriarte, Miguel Sanchez.`'",image data,Aerial Cactus Identification,,Determine whether an image contains a columnar cactus,AUC,aerial-cactus-identification 56,"'`Text documents are one of the richest sources of data for businesses. Well use a public dataset from the BBC comprised of 2225 articles, each labeled under one of 5 categories: business, entertainment, politics, sport or tech. The dataset is broken into 1490 records for training and 735 for testing. The goal will be to build a system that can accurately classify previously unseen news articles into the right category. The competition is evaluated using Accuracy as a metric. Following blog has good information on how to look at the problem. https://cloud.google.com/blog/products/gcp/problem-solving-with-ml-automatic-document-classification`'",text data,AI Academy Intermediate Class Competition 1,inClass,News Article Categorization,categorizationaccuracy,ai-academy-intermediate-class-competition-1 57,'` `',,AIDefenseGame18011862,inClass,AIDefenseGame18011862,categorizationaccuracy,aidefensegame18011862 58,"'`Airbus is excited to challenge Kagglers to build a model that detects all ships in satellite images as quickly as possible. Can you find them even in imagery with clouds or haze? Heres the backstory: Shipping traffic is growing fast. More ships increase the chances of infractions at sea like environmentally devastating ship accidents, piracy, illegal fishing, drug trafficking, and illegal cargo movement. This has compelled many organizations, from environmental protection agencies to insurance companies and national government authorities, to have a closer watch over the open seas. Airbus offers comprehensive maritime monitoring services by building a meaningful solution for wide coverage, fine details, intensive monitoring, premium reactivity and interpretation response. Combining its proprietary-data with highly-trained analysts, they help to support the maritime industry to increase knowledge, anticipate threats, trigger alerts, and improve efficiency at sea. A lot of work has been done over the last 10 years to automatically extract objects from satellite images with significative results but no effective operational effects. Now Airbus is turning to Kagglers to increase the accuracy and speed of automatic ship detection. Algorithm Speed Prize: After the Kaggle challenge is complete, competitors may submit their model via a private Kaggle kernel for a speed evaluation based upon the inference time on over 40.000 images chips (typical size of a full satellite image) to win a special algorithm speed prize. If you're interested to explore more Airbus data, you are welcomed to check out the OneAtlas Sandbox. And for more insights on our Maritime Surveillance capabilities, have a look at Airbus Intelligence page.`'",image data,Airbus Ship Detection Challenge,,Find ships on satellite images as quickly as possible,IntersectionOverUnionObjectSegmentationBeta,airbus-ship-detection-challenge 59,"'`When youve been devastated by a serious car accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. This is why Allstate, a personal insurer in the United States, is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect. Allstate is currently developing automated methods of predicting the cost, and hence severity, of claims. In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstates efforts to ensure a worry-free customer experience. New to Kaggle? This competition is a recruiting competition, your chance to get a foot in the door with the hiring team at Allstate.`'",tabular data,Allstate Claims Severity,,How severe is an insurance claim?,MAE,allstate-claims-severity 60,"'`As a customer shops an insurance policy, he/she will receive a number of quotes with different coverage options before purchasing a plan. This is represented in this challenge as a series of rows that include a customer ID, information about the customer, information about the quoted policy, and the cost. Your task is to predict the purchased coverage options using a limited subset of the total interaction history. If the eventual purchase can be predicted sooner in the shopping window, the quoting process is shortened and the issuer is less likely to lose the customer's business. Using a customers shopping history, can you predict what policy they will end up choosing?`'",tabular data,Allstate Purchase Prediction Challenge,,Predict a purchased policy based on transaction history,CategorizationAccuracy,allstate-purchase-prediction-challenge 61,"'`As the 2nd largest provider of carbohydrates in Africa, cassava is a key food security crop grown by small-holder farmers because it can withstand harsh conditions. At least 80% of small-holder farmer households in Sub-Saharan Africa grow cassava and viral diseases are major sources of poor yields. In this competition, we introduce a dataset of 5 fine-grained cassava leaf disease categories with 9,436 labeled images collected during a regular survey in Uganda, mostly crowdsourced from farmers taking images of their gardens, and annotated by experts at the National Crops Resources Research Institute (NaCRRI) in collaboration with the AI lab in Makarere University, Kampala. The dataset consists of leaf images of the cassava plant, with 9,436 annotated images and 12,595 unlabeled images of cassava leaves. Participants can choose to use the unlabeled images as additional training data. The goal is to learn a model to classify a given image into these 4 disease categories or a 5th category indicating a healthy leaf, using the images in the training data (participants can choose to use the unlabeled images in their training data). This competition is part of the fine-grained visual-categorization workshop (FGVC6 workshop) at CVPR 2019. Acknowledgements We thank the different experts and collaborators from NaCRRI for assisting in preparing this dataset. We thank Ernest Mwebaze, Timnit Gebru, Andrea Frome, Solomon Nsumba, Jeremy Tusubira, and Chris Omongo for developing the original version of this challenge. Citation Please cite this paper if you use the dataset for your project: https://arxiv.org/pdf/1908.02900.pdf`'",image data,Cassava Disease Classification,inClass,Classify pictures of cassava leaves into 1 of 4 disease categories (or healthy),categorizationaccuracy,cassava-disease-classification 62,"'`This is a project from the course T81-855: Applications of Deep Learning at Washington University in St. Louis. All students must create a Kaggle account and submit a solution. If you are a student, once you have submitted your solution entry log into Canvas (at WUSTL) and submit a single file telling me your Kaggle name on the leaderboard (you do not need to register to Kaggle with your real name). This competition will be visible to the public, so there may be non-student submissions as well as student. This competition see if you can predict a ""business score"" for each of the USA zipcodes. The exact nature of the ""business score"" (score column) is not disclosed. It is connected to the business/market saturation of a zipcode. The data files given to you on Kaggle will not be sufficient to predict. You must augment your data with additional files. Some suggestions include: US Government public data for ""SOI Tax Stats - Individual Income Tax Statistics"" Data.GOV: The home of the U.S. Governments open data This video describes the competition.`'",tabular data,"Applications of Deep Learning(WUSTL, Spring 2019)",inClass,Predict a score indicating business saturation of USA zipcodes,rmse,"applications-of-deep-learning(wustl,-spring-2019)" 63,"'`This Kaggle is a project from the course T81-855: Applications of Deep Learning at Washington University in St. Louis. All students must create a Kaggle account and submit a solution. If you are a student, once you have submitted your solution entry log into Canvas (at WUSTL) and submit a single file telling me your Kaggle name on the leaderboard (you do not need to register to Kaggle with your real name). This competition will be visible to the public, so there may be non-student submissions as well as from the students. For this competition, you will determine if a person is wearing glasses or not. However, your test subjects are not real people. A Generative Adversarial Neural Network (GAN) created all of the people that you see in this competition. The GAN network creates these images using a 512 number latent vector. The Kaggle assignment provides both the latent vectors and the faces produced by those vectors. Both may be useful to you in classifying if someone is wearing glasses or not. Glasses come in many different forms, make sure you can identify them all. The image below shows a sampling of some of the types of glasses contained in the dataset.`'",image data,"Applications of Deep Learning(WUSTL, Spring 2020)",,Computer vision: glasses or not?,logloss,"applications-of-deep-learning(wustl,-spring-2020)" 64,"'`This competition asks you to count paperclips on sheets of paper. Image counting applications are common for many sorts of data analysis. Images such as you see on Google Maps are frequent targets. Counting cars, people, trees, houses, water coverage, can all provide valuable information about the demographics of these regions. For this competition you are given images of paper covered by paperclips. Your task is to count the paperclips and return as accurate of a count as possible. The images are 256x256. Some images are very easy: Some images are a little more difficult: Some images will be really challenging:`'",image data,"Applications of Deep Learning(WUSTL, Spring 2020B)",,Computer vision: count the paperclips,rmse,"applications-of-deep-learning(wustl,-spring-2020b)" 65,"'`Imagine being able to detect blindness before it happened. Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working aged adults. Aravind Eye Hospital in India hopes to detect and prevent this disease among people living in rural areas where medical screening is difficult to conduct. Successful entries in this competition will improve the hospitals ability to identify potential patients. Further, the solutions will be spread to other Ophthalmologists through the 4th Asia Pacific Tele-Ophthalmology Society (APTOS) Symposium Currently, Aravind technicians travel to these rural areas to capture images and then rely on highly trained doctors to review the images and provide diagnosis. Their goal is to scale their efforts through technology; to gain the ability to automatically screen images for disease and provide information on how severe the condition may be. In this synchronous Kernels-only competition, you'll build a machine learning model to speed up disease detection. Youll work with thousands of images collected in rural areas to help identify diabetic retinopathy automatically. If successful, you will not only help to prevent lifelong blindness, but these models may be used to detect other sorts of diseases in the future, like glaucoma and macular degeneration. Get started today!`'",image data,APTOS 2019 Blindness Detection,,Detect diabetic retinopathy to stop blindness before it's too late ,QuadraticWeightedKappa,aptos-2019-blindness-detection 66,"'`The William and Flora Hewlett Foundation (Hewlett) is sponsoring the Automated Student Assessment Prize (ASAP). Hewlett is appealing to data scientists and machine learning specialists to help solve an important social problem. We need fast, effective and affordable solutions for automated grading of student-written essays. Hewlett is sponsoring the following prizes: $60,000: 1 place $30,000: 2 place $10,000: 3 place You are provided access to hand scored essays, so that you can build, train and test scoring engines against a wide field of competitors. Your success depends upon how closely you can deliver scores to those of human expert graders. While we believe that these financial incentives are important, we also intend to introduce top performers both to leading vendors in the industry and/or an established base of interested buyers. Hewlett is opening the field of automated student assessment to you. We want to induce a breakthrough that is both personally satisfying and game-changing for improving public education. Today, state departments of education are developing new forms of testing and grading methods, to assess the new common core standards. In this environment the need for more sophisticated and affordable options is vital. For example, we know that essays are an important expression of academic achievement, but they are expensive and time consuming for states to grade them by hand. So, we are frequently limited to multiple-choice standardized tests. We believe that automated scoring systems can yield fast, effective and affordable solutions that would allow states to introduce essays and other sophisticated testing tools. We believe that you can help us pave the way towards a breakthrough. ASAP is designed to achieve the following goals: Challenge developers of automated student assessment systems to demonstrate their current capabilities. Compare the efficacy and cost of automated scoring to that of human graders. Reveal product capabilities to state departments of education and other key decision makers interested in adopting them. The graded essays are selected according to specific data characteristics. On average, each essay is approximately 150 to 550 words in length. Some are more dependent upon source materials than others. This range of essay type is provided so that we can better understand the strengths of your solution. It is our intent to showcase quality and reliability, based on how well you can match expert human graders for each essay. You will be provided with training data for each essay prompt. The number of training essays does vary. For example, the lowest amount of training data is 1,190 essays, randomly selected from a total of 1,982. The data will contain ASCII formatted text for each essay followed by one or more human scores, and (where necessary) a final resolved human score. Where it is relevant, you are provided with more than one human score, so that you may evaluate the reliability of the human scorers, but - keep in mind - that you will be predicting to the resolved score. Also, please note that most essays are scored using a holistic scoring rubric. However, one data set uses a trait scoring rubric. The variability is intended to test the limits of your scoring engines capabilities. Following a period of 3 months to build and/or train your engine, you will be provided with test data that will contain new essays, randomly selected for blind evaluation. However, you will notice that the rater and resolved score columns will be blank. You will be asked to supply, based on your engine's predictions for each essay, your score in the resolved score column and then submit your new data set on this site. As part of the file that you will submit with your predictive scores, you will be asked to submit additional information. We would like to understand both the time and capital that youve spent developing your engine, the profile of your team (or you as an individual if you are working alone) and the projected cost to implement your solution on a larger scale, along with any known limitations. Basically, you will have the opportunity to present your case for who you are, why your model is commercially viable and to what extent you can use your model to satisfy the interests of potential buyers. This other information will not be used to determine any prize rewards, and it is optional. But, if you provide it, it will be used to evaluate whether or not your model should be presented to state departments of education and others who stand to benefit from your work. Also, please note that it is our intention to stage other follow-on ASAP phases in the months ahead. We are starting with graded essays and will follow with new data: Phase 1: Demonstration for long-form constructed response (essays); Phase 2: Demonstration for short-form constructed response (short answers); Phase 3: Demonstration for symbolic mathematical/logic reasoning (charts/graphs). In every instance, we seek to drive innovation for new solutions to automated student assessment. We hope that you will enjoy this process. May the best model win!`'",text data,The Hewlett Foundation: Automated Essay Scoring,,Develop an automated scoring algorithm for student-written essays.,WeightedMeanQuadraticWeightedKappa,the-hewlett-foundation:-automated-essay-scoring 67,"'`Q: How much does it cost to cool a skyscraper in the summer? A: A lot! And not just in dollars, but in environmental impact. Thankfully, significant investments are being made to improve building efficiencies to reduce costs and emissions. The question is, are the improvements working? Thats where you come in. Under pay-for-performance financing, the building owner makes payments based on the difference between their real energy consumption and what they would have used without any retrofits. The latter values have to come from a model. Current methods of estimation are fragmented and do not scale well. Some assume a specific meter type or dont work with different building types. In this competition, youll develop accurate models of metered building energy usage in the following areas: chilled water, electric, hot water, and steam meters. The data comes from over 1,000 buildings over a three-year timeframe. With better estimates of these energy-saving investments, large scale investors and financial institutions will be more inclined to invest in this area to enable progress in building efficiencies. About the Host Founded in 1894, ASHRAE serves to advance the arts and sciences of heating, ventilation, air conditioning refrigeration and their allied fields. ASHRAE members represent building system design and industrial process professionals around the world. With over 54,000 members serving in 132 countries, ASHRAE supports research, standards writing, publishing and continuing education - shaping tomorrows built environment today. Banner photo by Federico Beccari on Unsplash`'",tabular data,ASHRAE - Great Energy Predictor III,,How much energy will a building consume?,RMSLE,ashrae-great-energy-predictor-iii 68,"'`In this competition you will analyze images of galaxies to determine the probability that it belongs to particular class. Your model must have at least the following: At least two Convolutional Layers followed by normalization and pooling layers. Activation function ReLU. Optimizer: Gradient Descent. At least one fully connected layer followed by softmax transformation. There are several techniques to improve the performance and generalization of the Convolutional Neural Network. The objective is to try different methods to tweak the performance of your model and observe how the performance changes. You may use CNN's/ResNet/ODENet, etc.`'",image data,Test 1,inClass,Galaxy Merger Detection: Classify the morphologies of distant galaxies in our Universe,rmse,test-1 69,"'`Mankind Vs Machine Sydney Prediction Challenge Introduction The Mankind vs Machine competition requires data for either bespoke or automated predictions to be compared. I decided it would be great if this data and competition could tie into a broader social purpose, and introduce everybody to an awesome open dataset and a compelling problem. THE OPEN DATASET: The Australian Charities and Not for Profit Commission (ACNC) Annual Information Statement, containing activity, beneficiary, and financial information for all registered charities in Australia. Data has been combined over 2014, 2015, and 2016. THE PROBLEM: Optimisation of government grants to the non profit sector (which types of services need it most)? A POSSIBLE SOLUTION: Creating an estimation of a given charity's income from donations/bequests/etc. THE TARGET FEATURE: [donations_and_bequests], which is the charity's reported income from donations or bequests for a given fiscal year. THE EVALUATION METRIC: Root Mean Square Error on the log(1+x) transformed prediction, where x is the predicted target feature [donations_and_bequests] THE PRIZE: GLORY and Swag from GA More about this competition, and an intro to the AutoML service TPOT, can be found using these links. Over time, more and more information will be available in the competition Kernals. Intro Talk: http://bit.ly/2SVkFpw Github repo with TPOT example: http://bit.ly/2qAwaWz Acknowledgements We thank the ACNC for making this dataset available!`'",tabular data,Mankind Vs Machine Sydney,inClass,Submit predictions and see if you can beat the machines,rmse,mankind-vs-machine-sydney 70,"'`In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding. For this competition, we have provided 11 days worth of Avazu data to build and test prediction models. Can you find a strategy that beats standard classification algorithms? The winning models from this competition will be released under an open-source license.`'",tabular data,Click-Through Rate Prediction,,Predict whether a mobile ad will be clicked,LogLoss,click-through-rate-prediction 71,"'`When selling used goods online, a combination of tiny, nuanced details in a product description can make a big difference in drumming up interest. Details like: And, even with an optimized product listing, demand for a product may simply not existfrustrating sellers who may have over-invested in marketing. Avito, Russias largest classified advertisements website, is deeply familiar with this problem. Sellers on their platform sometimes feel frustrated with both too little demand (indicating something is wrong with the product or the product listing) or too much demand (indicating a hot item with a good description was underpriced). In their fourth Kaggle competition, Avito is challenging you to predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts. With this information, Avito can inform sellers on how to best optimize their listing and provide some indication of how much interest they should realistically expect to receive.`'",tabular data_image data,Avito Demand Prediction Challenge,,Predict demand for an online classified ad,RMSE,avito-demand-prediction-challenge 72,"'`Online marketplaces make it a breeze for users to both find and buy unique treasures or unload their dusty record collections in the spirit of spring cleaning. As one of the world's largest and fastest growing online classifieds, Avito hosts high volumes of listings and competitive sellers often go to great lengths to get their wares noticed. For some sellers, this means posting the same ad several times with slightly altered text or photos taken from different angles. To ensure that buyers can easily find what they're looking for without sifting through dozens of deceptively identical ads, Avito is asking Kagglers to develop a model that can automatically spot duplicate ads. With more accurate duplicate ad detection, Avito will make it much easier for buyers to find and make their next purchase with an honest seller.`'",tabular data_image data,Avito Duplicate Ads Detection,,Can you detect duplicitous duplicate ads?,AUC,avito-duplicate-ads-detection 73,"'`Bienvenido a este nuevo reto que nos brinda el BBVA! En esta ocasin optamos que nuestros clientes a parte de sus consumos recurrentes que realiza con su tarjeta de crdito y/o dbito, tambin tengan la oportunidad de experimentar consumos en nuevos establecimientos comercios en donde nunca se atrevieron a ingresar. Por ello, en este reto se desea saber cul es el mejor algoritmo y/o mtodo que nos ayude a identificar los establecimientos ms recomendados por cliente. La informacin que se adjunta consta de 30,000 clientes, 74,339 establecimientos y se encuentra a nivel de transacciones en un periodo de 12 meses conjuntamente con otras variables de perfilamiento del cliente y un indicador de Rating ratingMonto. El campo ratingMonto se caracteriza por relacionarse con el monto de consumo y el establecimiento, donde: \[ ratingMonto_{(ij)} = \frac{MontoEstab_{(ij)}}{Monto Total Cliente_{(i)}}\] i: es el i-simo cliente j: es el j-simo establecimiento Para generar tus soluciones tendrs que poner en prctica tus conocimientos en desarrollar el algoritmo ms ptimo que identifique los establecimientos ms recomendados por cliente. Las soluciones para la base 03dataBaseTestRec.csv debern ser subidas a esta Plataforma. Esta determinar tu posicin en el ranking. Los 5 primeros puestos pasarn a una segunda fase, a realizarse el 12 de Octubre, en dnde tendrn que presentar sus algoritmos frente a un jurado.`'",tabular data,Sistema Recomendador BBVA,inClass,Se requiere desarrollar el mejor algoritmo para identificar los establecimientos más recomendados para nuestros clientes.,rmse,sistema-recomendador-bbva 74,"'`The goal of this challenge is to recognize one user amongst all others. Keep in mind that data is highly inbalanced. Evaluation metrics of this challenge is ROC AUC.`'",tabular data,Classification 3 Challenge,inClass,Do your best!,auc,classification-3-challenge 75,"'`Challenge and dataset summary available at https://arxiv.org/abs/2010.00170 Bengali is the 5th most spoken language in the world with hundreds of million of speakers. Its the official language of Bangladesh and the second most spoken language in India. Considering its reach, theres significant business and educational interest in developing AI that can optically recognize images of the language handwritten. This challenge hopes to improve on approaches to Bengali recognition. Optical character recognition is particularly challenging for Bengali. While Bengali has 49 letters (to be more specific 11 vowels and 38 consonants) in its alphabet, there are also 18 potential diacritics, or accents. This means that there are many more graphemes, or the smallest units in a written language. The added complexity results in ~13,000 different grapheme variations (compared to Englishs 250 graphemic units). Bangladesh-based non-profit Bengali.AI is focused on helping to solve this problem. They build and release crowdsourced, metadata-rich datasets and open source them through research competitions. Through this work, Bengali.AI hopes to democratize and accelerate research in Bengali language technologies and to promote machine learning education. For this competition, youre given the image of a handwritten Bengali grapheme and are challenged to separately classify three constituent elements in the image: grapheme root, vowel diacritics, and consonant diacritics. By participating in the competition, youll hopefully accelerate Bengali handwritten optical character recognition research and help enable the digitalization of educational resources. Moreover, the methods introduced in the competition will also empower cousin languages in the Indian subcontinent. Acknowledgements: Apurba: Apurba is the exclusive sponsor of Bengali.AI for this competition. Apurba Technologies Inc. is founded by a group of technology veterans who have been working at the cutting edge of software development in Silicon Valley for many years. Apart from its many ventures, Apurba is a pioneer in Bengali NLP research today and is accelerating AI research in Bangladesh through its contributions. Intelligent Machines Limited: Intelligent Machines Limited is the technical partner of Bengali.AI for this competition and is providing compute support to Bangladeshi students. IML is an Artificial Intelligence and Advanced Analytics startup offering customized solutions to businesses in Bangladesh. IML believes in the strength of Bangladeshi talented resources and in the possibility of a far greater and developed Bangladesh in the coming days.`'",image data,Bengali.AI Handwritten Grapheme Classification,,Classify the components of handwritten Bengali,WeightedCategorizationAccuracy,bengali.ai-handwritten-grapheme-classification 76,"'`Introduction This competition is part of the Solar Flare Prediction from Time Series of Solar Magnetic Field Parameters in the IEEE Big Data 2020 big Data Cup. After the competition phase is completed, a link for the submission of the accompanying academic paper will be provided to the top 10 participants as ranked by the public/private leaderboard weighting described on the competition website. The academic papers will be ranked by peer reviewers and a final decision will be made using the weighting method detailed on the competition website. Important Notes We are interested in the ideas and their implementations, and not necessarily the highest score. Therefore, we strongly encourage contestants to keep their code in a (for now, private) repository and share it with us. There are other evaluation metrics (e.g., `TSS` and `HSS`) that are more appropriate for rare-event classification tasks such as ours. Since they are not available at Kaggle, for ranking the ideas we rely on the most relevant metric that works for our multi-class task. But in the academic paper, if you are interested in continuing in that direction, we certainly need other metrics to be included as well. We would be more than happy to provide more information on this topic. Organizers On behalf of the Data Mining Lab ( dmlab.cs.gsu.edu ), in the Department of Computer Science, Georgia State University, this challenge has been organized by the following. Dustin J. Kempton, Ph.D. (cs.gsu.edu/~dkempton1) Berkay Aydin, Ph.D. (cs.gsu.edu/~baydin/) Azim Ahmadzadeh, Ph.D. Candidate (grid.cs.gsu.edu/~aahmadzadeh1/) Rafal A. Angryk, Ph.D. (cs.gsu.edu/~rangryk)`'",time series,BigData Cup Challenge 2020: Flare Prediction,,Solar Flare Prediction from Time Series of Solar Magnetic Field Parameters,meanfscore,bigdata-cup-challenge-2020:-flare-prediction 77,"'`Weve all been there: Stuck at a traffic light, only to be given mere seconds to pass through an intersection, behind a parade of other commuters. Imagine if you could help city planners and governments anticipate traffic hot spots ahead of time and reduce the stop-and-go stress of millions of commuters like you. Geotab provides a wide variety of aggregate datasets gathered from commercial vehicle telematics devices. Harnessing the insights from this data has the power to improve safety, optimize operations, and identify opportunities for infrastructure challenges. The dataset for this competition includes aggregate stopped vehicle information and intersection wait times. Your task is to predict congestion, based on an aggregate measure of stopping distance and waiting times, at intersections in 4 major US cities: Atlanta, Boston, Chicago & Philadelphia. This competition is being hosted in partnership with BigQuery, a data warehouse for manipulating, joining, and querying large scale tabular datasets. BigQuery also offers BigQuery ML, an easy way for users to create and run machine learning models to generate predictions through a SQL query interface. Kaggle recently released a BigQuery integration within our kernels notebook environment, and this starter kernel gives you a great starting point for how to use BQ & BQML. Youre encouraged to use your data savvy, resourcefulness & intuition to find and join in additional external datasets that will increase your models predictive power. Alright, stop waiting and get started! Acknowledgments A big thanks to Geotab for providing the dataset for this competition! Geotab is advancing security, connecting commercial vehicles to the internet and providing web-based analytics to help customers better manage their fleets. Geotabs open platform and Marketplace, offering hundreds of third-party solution options, allows both small and large businesses to automate operations by integrating vehicle data with their other data assets. As an IoT hub, the in-vehicle device provides additional functionality through IOX Add-Ons. Processing billions of data points a day, Geotab leverages data analytics and machine learning to help customers improve productivity, optimize fleets through the reduction of fuel consumption, enhance driver safety, and achieve strong compliance to regulatory changes. Geotabs products are represented and sold worldwide through Authorized Geotab Resellers. To learn more, please visit www.geotab.com and follow us @GEOTAB and on LinkedIn.`'",tabular data,BigQuery-Geotab Intersection Congestion,,Can you predict wait times at major city intersections?,RMSE,bigquery-geotab-intersection-congestion 78,"'`The objective of the competition is to help us build as good a model as possible so that we can, as optimally as this data allows, relate molecular information, to an actual biological response. We have shared the data in the comma separated values (CSV) format. Each row in this data set represents a molecule. The first column contains experimental data describing an actual biological response; the molecule was seen to elicit this response (1), or not (0). The remaining columns represent molecular descriptors (d1 through d1776), these are calculated properties that can capture some of the characteristics of the molecule - for example size, shape, or elemental constitution. The descriptor matrix has been normalized.`'",tabular data,Predicting a Biological Response,inClass,Predict a biological response of molecules from their chemical properties,LogLoss,predicting-a-biological-response 79,"'`Do you hear the birds chirping outside your window? Over 10,000 bird species occur in the world, and they can be found in nearly every environment, from untouched rainforests to suburbs and even cities. Birds play an essential role in nature. They are high up in the food chain and integrate changes occurring at lower levels. As such, birds are excellent indicators of deteriorating habitat quality and environmental pollution. However, it is often easier to hear birds than see them. With proper sound detection and classification, researchers could automatically intuit factors about an areas quality of life based on a changing bird population. There are already many projects underway to extensively monitor birds by continuously recording natural soundscapes over long periods. However, as many living and nonliving things make noise, the analysis of these datasets is often done manually by domain experts. These analyses are painstakingly slow, and results are often incomplete. Data science may be able to assist, so researchers have turned to large crowdsourced databases of focal recordings of birds to train AI models. Unfortunately, there is a domain mismatch between the training data (short recording of individual birds) and the soundscape recordings (long recordings with often multiple species calling at the same time) used in monitoring applications. This is one of the reasons why the performance of the currently used AI models has been subpar. To unlock the full potential of these extensive and information-rich sound archives, researchers need good machine listeners to reliably extract as much information as possible to aid data-driven conservation. The Cornell Lab of Ornithologys Center for Conservation Bioacoustics (CCB)s mission is to collect and interpret sounds in nature. The CCB develops innovative conservation technologies to inspire and inform the conservation of wildlife and habitats globally. By partnering with the data science community, the CCB hopes to further its mission and improve the accuracy of soundscape analyses. In this competition, you will identify a wide variety of bird vocalizations in soundscape recordings. Due to the complexity of the recordings, they contain weak labels. There might be anthropogenic sounds (e.g., airplane overflights) or other bird and non-bird (e.g., chipmunk) calls in the background, with a particular labeled bird species in the foreground. Bring your new ideas to build effective detectors and classifiers for analyzing complex soundscape recordings! If successful, your work will help researchers better understand changes in habitat quality, levels of pollution, and the effectiveness of restoration efforts. Reliable machine listeners would also allow conservationists to deploy more recording units worldwide and would enable data-driven conservation at a scale not yet possible. The eventual conservation outcomes could greatly improve the quality of life for many living organismsbirds and human beings included.`'",audio,Cornell Birdcall Identification,,Build tools for bird population monitoring,MeanFScoreBeta,cornell-birdcall-identification 80,"'`There is a $10,000 prize pool for this competition, with prizes awarded to the top 3 places: 1st place: $6,500 2nd place: $2,500 3rd place: $1,000`'",tabular data,Blue Book for Bulldozers,featured,"Predict the auction sale price for a piece of heavy equipment to create a ""blue book"" for bulldozers.",RMSLE,blue-book-for-bulldozers 81,"'`As a global specialist in personal insurance, BNP Paribas Cardif serves 90 million clients in 36 countries across Europe, Asia and Latin America. In a world shaped by the emergence of new uses and lifestyles, everything is going faster and faster. When facing unexpected events, customers expect their insurer to support them as soon as possible. However, claims management may require different levels of check before a claim can be approved and a payment can be made. With the new practices and behaviors generated by the digital economy, this process needs adaptation thanks to data science to meet the new needs and expectations of customers. In this challenge, BNP Paribas Cardif is providing an anonymized database with two categories of claims: Kagglers are challenged to predict the category of a claim based on features available early in the process, helping BNP Paribas Cardif accelerate its claims process and therefore provide a better service to its customers.`'",tabular data,BNP Paribas Cardif Claims Management,,Can you accelerate BNP Paribas Cardif's claims management process?,LogLoss,bnp-paribas-cardif-claims-management 82,"'`This is the home page of the competition. You don't need a subtitle here. The competition sub-title will appear above. This is where you introduce the problem. You can upload images using the ""select files"" widget on the left in the competition wizard. Upload an image, refresh the page, copy its URL, then insert within the wizard's editor. If you are copynnnnnn-pasting from another application, like Word or your browser, try to make sure the html formatting is clean. You can view a page's html using the button at the top right of the editor's toolbar. We thank Professor Plum, Ph.D. for providing this dataset.`'",text data,Book Reviews,inClass,book reviews,categorizationaccuracy,book-reviews 83,'``',tabular data,Boston Housing Dataset,,Предсказание цены квартиры в зависимости от ее района.,mae,boston-housing-dataset 84,"'`Task Description The task in this competition is to predict the housing prices on the test set (test.csv) using all the methods we introduced in the section we covered in the Regression, Resampling and Regularization module of the course. Quiz The first part of the competition will be held in class on Wednesday, February 19 as a quiz. It must therefore be submitted by 6:45pm on that day. For the quiz, you are required to generate a valid prediction and submit it to Kaggle. You will be graded on how well you will use the material we learned, on the quality of your code and, to a lesser extent, on the performance of your model. You will be randomly assigned to groups. You will also need to provide a peer review on the performance of your teammates. All team members are expected to contribute. Download the Jupyter Notebook 'template' under the Notebooks tab. In the first block of the Jupyter Notebook, type in the names of all your team members. There are also hints and instructions that you might find useful in the template. By the end of class, make sure that all your code blocks run on the Jupyter Notebook. Zip the entire folder, including the dataset, rename it to team_name.zip, and upload to Canvas. Only one member of the team needs to submit to Kaggle and to Canvas. Note that you only have Kaggle 10 submissions per day. (This is to avoid overfitting on the test set.) Use them wisely. HW3 The second part of the competition will be assigned as the second part of HW3. It will be due on Sunday, February 23 at 11:59pm. And it is worth overall 8 points. Continue to work on the competition with your teammates. Your main job is to improve the performance of your model and, ideally, come up with the best score on Kaggle. But we will also scrutinize your code to understand what you did. So, it is not just about happening to get the best performance on the test set. Teams that win 1st to 3rd place on Kaggle will start with the full 8 points, 4th to 6th place will start with 7 points, and others will start from 6 points. Though, again, remember that further points could be deducted for the quality of the code In addition to the Kaggle score associated with your team, zip your HW folder with the data, rename it to team_name.zip and submit to Canvas. Note that you again only have 10 submissions per day. So, again, use them wisely. The final result will depend on your scores on the private leaderboard.`'",tabular data,ChapmanCS530 Data Mining Hackathon: Housing Prices,inClass,Can you predict the price of a house?,rmse,chapmancs530-data-mining-hackathon:-housing-prices 85,"'`IMPORTANT ANNOUNCEMENT (2 MAY 2018) We'd like to get in contact with you! Dear competitors, We would like to get in touch with everyone taking part in the Capitec BBLB 2018 Data Science Competition! Please send an email from your university email address to JeanneDaniel@capitecbank.co.za so we can send you information regarding the Competition Presentation and Award Ceremony. Please also add your Kaggle Username. The Presentation and Award Ceremony will be taking place on 17 May 2018, 1PM - 5PM. Again, this is for everyone that is taking part in the competition. Looking forward to hearing from you! Competition Description Here at Capitec we want our clients to be in charge of their financial lives, so they can live better while banking better. With the fast changing South African economic circumstances, budgeting becomes an important aspect of financial health. One way to ease the economic stress of our clients is to help them budget better by showing them where their money is going. We at Capitec believe that this shouldnt be a worry for our clients, and we invite you to help us create a solution! The goal of this competition is to identify predetermined transaction categories based on the transactional data of our clients. Any student of the University of Stellenbosch, University of Western Cape or Cape Town University may enter. Important Note: Remember to create your Kaggle account with your university email, else won't be eligible for the prizes. Competition prizes We will choose the top 5 candidates from the private leaderboard entries. Each candidate will present their solution in order to determine the winner. The top candidates will also be required to provide their code. The presentation date will be finalised in the week following the end of the competition. The leaderboard score will count for 65% of the final mark and the presentation for 35%. Prizes: 1st R6 000 2nd R4 000 3rd R3 000 4th and 5th R1 500 each Students are encouraged to make use of the Discussion forums if they have any questions regarding the problem. Competition duration: 6 April - 6 May 2018`'",tabular data,Capitec BBLB,inClass,Using Data Science to improve lives,multiclassloss,capitec-bblb 86,"'`CareerCon 2019 is upon us! CareerCon is a digital event all about landing your first data science job and registration is now open! Ahead of the event, we have a fun competition to get you started. See below for a unique challenge and opportunity to share your resume with select CareerCon sponsors. ___________________________________ The Competition Robots are smart by design. To fully understand and properly navigate a task, however, they need input about their environment. In this competition, youll help robots recognize the floor surface theyre standing on using data collected from Inertial Measurement Units (IMU sensors). Weve collected IMU sensor data while driving a small mobile robot over different floor surfaces on the university premises. The task is to predict which one of the nine floor types (carpet, tiles, concrete) the robot is on using sensor data such as acceleration and velocity. Succeed and you'll help improve the navigation of robots without assistance across many different surfaces, so they wont fall down on the job. Special thanks for making this competition possible: The data for this competition has been collected by Heikki Huttunen and Francesco Lomio from the Department of Signal Processing and Damoon Mohamadi, Kaan Celikbilek, Pedram Ghazi and Reza Ghabcheloo from the Department of Automation and Mechanical Engineering both from Tampere University, Finland. We at Kaggle would like thank them all for kindly donating the data that has made this competition possible!`'",tabular data,CareerCon 2019 - Help Navigate Robots,,Compete to get your resume in front of our sponsors,categorizationaccuracy,careercon-2019-help-navigate-robots 87,"'`As with any big purchase, full information and transparency are key. While most everyone describes buying a used car as frustrating, its just as annoying to sell one, especially online. Shoppers want to know everything about the car but they must rely on often blurry pictures and little information, keeping used car sales a largely inefficient, local industry. Carvana, a successful online used car startup, has seen opportunity to build long term trust with consumers and streamline the online buying process. An interesting part of their innovation is a custom rotating photo studio that automatically captures and processes 16 standard images of each vehicle in their inventory. While Carvana takes high quality photos, bright reflections and cars with similar colors as the background cause automation errors, which requires a skilled photo editor to change. In this competition, youre challenged to develop an algorithm that automatically removes the photo studio background. This will allow Carvana to superimpose cars on a variety of backgrounds. Youll be analyzing a dataset of photos, covering different vehicles with a wide variety of year, make, and model combinations.`'",image data,Carvana Image Masking Challenge,,Automatically identify the boundaries of the car in an image,Dice,carvana-image-masking-challenge 88,"'`As the second-largest provider of carbohydrates in Africa, cassava is a key food security crop grown by smallholder farmers because it can withstand harsh conditions. At least 80% of household farms in Sub-Saharan Africa grow this starchy root, but viral diseases are major sources of poor yields. With the help of data science, it may be possible to identify common diseases so they can be treated. Existing methods of disease detection require farmers to solicit the help of government-funded agricultural experts to visually inspect and diagnose the plants. This suffers from being labor-intensive, low-supply and costly. As an added challenge, effective solutions for farmers must perform well under significant constraints, since African farmers may only have access to mobile-quality cameras with low-bandwidth. In this competition, we introduce a dataset of 21,367 labeled images collected during a regular survey in Uganda. Most images were crowdsourced from farmers taking photos of their gardens, and annotated by experts at the National Crops Resources Research Institute (NaCRRI) in collaboration with the AI lab at Makerere University, Kampala. This is in a format that most realistically represents what farmers would need to diagnose in real life. Your task is to classify each cassava image into four disease categories or a fifth category indicating a healthy leaf. With your help, farmers may be able to quickly identify diseased plants, potentially saving their crops before they inflict irreparable damage. Recommended Tutorial We highly recommend Jesse Mostipaks Getting Started Tutorial that walks you through making your very first submission step by step. Acknowledgements The Makerere Artificial Intelligence (AI) Lab is an AI and Data Science research group based at Makerere University in Uganda. The lab specializes in the application of artificial intelligence and data science - including for example, methods from machine learning, computer vision and predictive analytics to problems in the developing world. Their mission is: To advance Artificial Intelligence research to solve real-world challenges."" We thank the different experts and collaborators from National Crops Resources Research Institute (NaCRRI) for assisting in preparing this dataset. This is a Code Competition. Refer to Code Requirements for details.`'",image data,Cassava Leaf Disease Classification,,Identify the type of disease present on a Cassava Leaf image,CategorizationAccuracy,cassava-leaf-disease-classification 89,"'`Is there a cat in your dat? A common task in machine learning pipelines is encoding categorical variables for a given algorithm in a format that allows as much useful signal as possible to be captured. Because this is such a common task and important skill to master, we've put together a dataset that contains only categorical features, and includes: binary features low- and high-cardinality nominal features low- and high-cardinality ordinal features (potentially) cyclical features This Playground competition will give you the opportunity to try different encoding schemes for different algorithms to compare how they perform. We encourage you to share what you find with the community. If you're not sure how to get started, you can check out the Categorical Variables section of Kaggle's Intermediate Machine Learning course. Have Fun!`'",tabular data,Categorical Feature Encoding Challenge,,"Binary classification, with every feature a categorical",AUC,categorical-feature-encoding-challenge 90,"'`CDP is aglobalnon-profit that drives companies and governments to reduce their greenhouse gas emissions, safeguard water resources, and protect forests. Each year, CDP takes the information supplied in its annual reporting process and scores companies and cities based on their journey through disclosure and towards environmental leadership. CDP houses the worlds largest, most comprehensive dataset on environmental action. As the data grows to include thousands more companies and cities each year, there is increasing potential for the data to be utilized in impactful ways. Because of this potential, CDP is excited to launch an analytics challenge for the Kaggle community. Data scientists will scour environmental information provided to CDP by disclosing companies and cities, searching for solutions to our most pressing problems related to climate change, water security, deforestation, and social inequity. How do you help cities adapt to a rapidly changing climate amidst a global pandemic, but do it in a way that is socially equitable? What are the projects that can be invested in that will help pull cities out of a recession, mitigate climate issues, but not perpetuate racial/social inequities? What are the practical and actionable points where city and corporate ambition join, i.e. where do cities have problems that corporations affected by those problems could solve, and vice versa? How can we measure the intersection between environmental risks and social equity, as a contributor to resiliency? PROBLEM STATEMENT Develop a methodology for calculating key performance indicators (KPIs) that relate to the environmental and social issues that are discussed in the CDP survey data. Leverage external data sources and thoroughly discuss the intersection between environmental issues and social issues. Mine information to create automated insight generation demonstrating whether city and corporate ambitions take these factors into account. HOW TO PARTICIPATE To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will only review the most recent entry. A starter notebook demonstrates how to load and work with the data. To be valid, a submission must be contained in one or more notebook, and made public on or before the submission deadline. Participants are free to use any datasets in addition to the official Kaggle dataset, but those datasets must be public and hosted on Kaggle for the submission to be valid.`'",tabular data,CDP - Unlocking Climate Solutions,,City-Business Collaboration for a Sustainable Future,pollution,cdp-unlocking-climate-solutions 91,"'`Think you can use your data science smarts to make big predictions at a molecular level? This challenge aims to predict interactions between atoms. Imaging technologies like MRI enable us to see and understand the molecular composition of tissues. Nuclear Magnetic Resonance (NMR) is a closely related technology which uses the same principles to understand the structure and dynamics of proteins and molecules. Researchers around the world conduct NMR experiments to further understanding of the structure and dynamics of molecules, across areas like environmental science, pharmaceutical science, and materials science. This competition is hosted by members of the CHemistry and Mathematics in Phase Space (CHAMPS) at the University of Bristol, Cardiff University, Imperial College and the University of Leeds. Winning teams will have an opportunity to partner with this multi-university research program on an academic publication Your Challenge In this competition, you will develop an algorithm that can predict the magnetic interaction between two atoms in a molecule (i.e., the scalar coupling constant). Once the competition finishes, CHAMPS would like to invite the top teams to present their work, discuss the details of their models, and work with them to write a joint research publication which discusses an open-source implementation of the solution. About Scalar Coupling Using NMR to gain insight into a molecules structure and dynamics depends on the ability to accurately predict so-called scalar couplings. These are effectively the magnetic interactions between a pair of atoms. The strength of this magnetic interaction depends on intervening electrons and chemical bonds that make up a molecules three-dimensional structure. Using state-of-the-art methods from quantum mechanics, it is possible to accurately calculate scalar coupling constants given only a 3D molecular structure as input. However, these quantum mechanics calculations are extremely expensive (days or weeks per molecule), and therefore have limited applicability in day-to-day workflows. A fast and reliable method to predict these interactions will allow medicinal chemists to gain structural insights faster and cheaper, enabling scientists to understand how the 3D chemical structure of a molecule affects its properties and behavior. Ultimately, such tools will enable researchers to make progress in a range of important problems, like designing molecules to carry out specific cellular tasks, or designing better drug molecules to fight disease. Join the CHAMPS Scalar Coupling challenge to apply predictive analytics to chemistry and chemical biology.`'",tabular data,Predicting Molecular Properties,,Can you measure the magnetic interactions between a pair of atoms?,GroupMeanLogMAE,predicting-molecular-properties 92,"'`One of the major problems that telecom operators face is customer retention. Because of which majority of the Telecom operators want to know which customer is most likely to leave them, so that they could immediately take certain actions like providing a discount or providing a customised plan, so that they could retain the customer.`'",tabular data,Telecom Churn Analytics,,Predict if a customer stays or leaves,categorizationaccuracy,telecom-churn-analytics 93,"'`Ciphertext Challenge II: The Challengening! It's baaaaaaack! In our first ciphertext competition, we hunted the wilds of the '90s-era internet. This time around, we're exploring the dark slow-broadband-y wastelands of 2011, with the Movie Review Dataset. In 2011 most of the internet hadn't even been invented yet*, so wow, you're in for a treat. Again, simple classic ciphers have been used to encrypt this dataset. Your mission this time: to correctly match each piece of ciphertext with its corresponding piece of plaintext. Daunting! Also, there are some new ciphers in play this time, which will involve some meta-puzzling. Enjoy! Swag prizes go to the first three teams to crack all four ciphers OR to the top three teams on the LB (in case the ciphers are not all cracked). Additionally, swag prizes will be awarded to the best competition-related kernels, in both visualization and cryptanalysis, based on upvotes. Go ahead. Get cracking! * - This is not true. Acknowledgements Maas, A., Daly, R., Pham, P., Huang, D., Ng, A. and Potts, C. (2011). Learning Word Vectors for Sentiment Analysis: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. [online] Portland, Oregon, USA: Association for Computational Linguistics, pp. 142150. Available here.`'",text data,Ciphertext Challenge II,,553398 418126 467884 411 374106 551004 356535 539549 487091 290502 121468 556912 469347 515719 201909 101,CategorizationAccuracy,ciphertext-challenge-ii 94,"'`Ciphertext Challenge III: Wherefore Art Thou, Simple Ciphers? We've done the 2010's, the 1990s now it's time for the 80s. The 1580s!! In this new decryption competition's dataset, we've gone from perfectly respectable sources of electronic horror to a time before computersheck, before calculus was called ""calculus""! Shakespeare's plays are encrypted, and we time travelers must un-encrypt them so people can do innovative stage productions with intricate makeup, costumes, and possiblypossibly!Leonardo DiCaprio. Think about it, folks: Leo.* As in previous ciphertext challenges, simple classic ciphers have been used to encrypt this dataset, along with a slightly less simple surprise that expands our definition of ""classic"" into the modern age. The mission is the same: to correctly match each piece of ciphertext with its corresponding piece of plaintext. Daunting! Meta-puzzles and difficulty await! Swag prizes go to the first three teams to crack all four ciphers OR to the top three teams on the leaderboard (in case the ciphers are not all cracked). Additionally, swag prizes will be awarded to the best competition-related kernels, in both visualization and cryptanalysis, based on upvotes. Last, the coveted ""Phil Prize""for the team that correctly deduces the form AND key of the final cipheris up for grabs again. Go ahead. Get cracking! * - Leo! Acknowledgements Many thanks to Kaggler LiamLarson for their excellent Shakespeare dataset.`'",text data,Ciphertext Challenge III,,BRBTvl0LNstxQLyxulCEEq1czSFje0Z6iajczo6ktGmitTE=,CategorizationAccuracy,ciphertext-challenge-iii 95,"'`csvId""label Id1,2,3..43submission_1.csv.`'",tabular data,class-test-data-mining,inClass,16级地大信管专业数据挖掘课堂竞赛,categorizationaccuracy,class-test-data-mining 96,'` !`',tabular data,Classification : Animal Classification,,Classification with Animal data,categorizationaccuracy,classification-:-animal-classification 97,"'`Fake news is a common phenomena on internet now a days. It affects every facet of our lives. It influences us in different ways. For example, fake news about the cure of COVID-19 can spread misinformation. To tackle this challenge, you are provided with a dataset to train your model and make predictions for future news if the news is real or fake. Acknowledgements Ahmed H, Traore I, Saad S. (2017) Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).`'",text data,Classifying the Fake News,,"Data Science and Big Data Analytics Course - UMT, Sialkot",meanfscore,classifying-the-fake-news 98,"'`The goal of this competition is to predict bike share use, given the hour, day, and information about the weather. Companies like Divvy try to predict how much demand there will be for bikes on any give day to allocate resources to redistribute bikes so that, ideally, very few bike stations are ever full (when you cant park your bike) or empty (when you cant pick up a bike if you want to). The data tab provides detailed information on the data set and necessary downloads. See the class google doc page on details about teams, grading, and updated tips.`'",tabular data,Prediction Competition - Bike Sharing Demand,inClass,"End of COMP 180 regression competition: You are to predict bike share use based on information such as day, time, and weather",rmse,prediction-competition-bike-sharing-demand 99,"'`This challenge serves as final project for the ""How to win a data science competition"" Coursera course. In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.`'",time series,Predict Future Sales,,"Final project for ""How to win a data science competition"" Coursera course",rmse,predict-future-sales 100,"'`Were excited to announce a beta-version of a brand-new type of ML competition called Simulations. In Simulation Competitions, youll compete against a set of rules, rather than against an evaluation metric. To enter, accept the rules and create a python submission file that can play against a computer, or another user. The Challenge In this game, your objective is to get a certain number of your checkers in a row horizontally, vertically, or diagonally on the game board before your opponent. When it's your turn, you drop one of your checkers into one of the columns at the top of the board. Then, let your opponent take their turn. This means each move may be trying to either win for you, or trying to stop your opponent from winning. The default number is four-in-a-row, but well have other options to come soon. Background History For the past 10 years, our competitions have been mostly focused on supervised machine learning. The field has grown, and we want to continue to provide the data science community cutting-edge opportunities to challenge themselves and grow their skills. So, whats next? Reinforcement learning is clearly a crucial piece in the next wave of data science learning. We hope that Simulation Competitions will provide the opportunity for Kagglers to practice and hone this burgeoning skill. How is this Competition Different? Instead of submitting a CSV file, or a Kaggle Notebook, you will submit a Python .py file (more submission options are in development). Youll also notice that the leaderboard is not based on how accurate your model is but rather how well youve performed against other users. See Evaluation for more details. Wed Love Your Feedback This competition is a low-stakes, trial-run introduction. Were considering this a beta launch there are complicated new mechanics in play and were still working on refining the process. Wed love your help testing the experience and want to hear your feedback. Please note that we may make changes throughout the competition that could include things like resetting the leaderboard, invalidating episodes, making changes to the interface, or changing the environment configuration (e.g. modifying the number of columns, rows, or tokens in a row required to win, etc).`'",tabular data,Connect X,,Connect your checkers in a row before your opponent!,custom metric,connect-x 101,"'`""when you have eliminated the impossible, whatever remains, however improbable, must be the truth"" -Sir Arthur Conan Doyle Our brains process the meaning of a sentence like this rather quickly. We're able to surmise: Some things to be true: ""You can find the right answer through the process of elimination. Others that may have truth: ""Ideas that are improbable are not impossible!"" And some claims are clearly contradictory: ""Things that you have ruled out as impossible are where the truth lies."" Natural language processing (NLP) has grown increasingly elaborate over the past few years. Machine learning models tackle question answering, text extraction, sentence generation, and many other complex tasks. But, can machines determine the relationships between sentences, or is that still left to humans? If NLP can be applied between sentences, this could have profound implications for fact-checking, identifying fake news, analyzing text, and much more. The Challenge: If you have two sentences, there are three ways they could be related: one could entail the other, one could contradict the other, or they could be unrelated. Natural Language Inferencing (NLI) is a popular NLP problem that involves determining how pairs of sentences (consisting of a premise and a hypothesis) are related. Your task is to create an NLI model that assigns labels of 0, 1, or 2 (corresponding to entailment, neutral, and contradiction) to pairs of premises and hypotheses. To make things more interesting, the train and test set include text in fifteen different languages! You can find more details on the dataset by reviewing the Data page. Today, the most common approaches to NLI problems include using embeddings and transformers like BERT. In this competition, were providing a starter notebook to try your hand at this problem using the power of Tensor Processing Units (TPUs). TPUs are powerful hardware accelerators specialized in deep learning tasks, including Natural Language Processing. Kaggle provides all users TPU Quota at no cost, which you can use to explore this competition. Check out our TPU documentation and Kaggles YouTube playlist for more information and resources. Recommended Tutorial We highly recommend Ana Sofia Uzsoys Tutorial that walks you through creating your very first submission step by step with TPUs and BERT. This is a great opportunity to flex your NLP muscles and solve an exciting problem! Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.`'",text data,"Contradictory, My Dear Watson",,Detecting contradiction and entailment in multilingual text using TPUs,categorizationaccuracy,"contradictory,-my-dear-watson" 102,"'`This is a relaunch of a previous competition, Conway's Reverse Game of Life, with the following changes: The grid size is larger (25 vs. 25) and the grid wraps around from top to bottom and left to right Submissions are solved forward by the appropriate number of steps, so that any correct starting solution will achieve a maximum score. This article contains the stepping function that is used for this competition. Obligatory Disclaimer: A lot has changed since the original competition was launched 6 years ago. With the change from ""exact starting point"" to ""any correct starting point"", it is possible to get a perfect score. We just don't know how difficult that will be. Use it as a fun learning experience, and don't spoil it for others by posting perfect solutions! ~~~~~~~~~ The Game of Life is a cellular automaton created by mathematician John Conway in 1970. The game consists of a board of cells that are either on or off. One creates an initial configuration of these on/off states and observes how it evolves. There are four simple rules to determine the next state of the game board, given the current state: Overpopulation: if a living cell is surrounded by more than three living cells, it dies. Stasis: if a living cell is surrounded by two or three living cells, it survives. Underpopulation: if a living cell is surrounded by fewer than two living cells, it dies. Reproduction: if a dead cell is surrounded by exactly three cells, it becomes a live cell. These simple rules result in many interesting behaviors and have been the focus of a large body of mathematics. As Wikipedia states Ever since its publication, Conway's Game of Life has attracted much interest, because of the surprising ways in which the patterns can evolve. Life provides an example of emergence and self-organization. It is interesting for computer scientists, physicists, biologists, biochemists, economists, mathematicians, philosophers, generative scientists and others to observe the way that complex patterns can emerge from the implementation of very simple rules. The game can also serve as a didactic analogy, used to convey the somewhat counter-intuitive notion that ""design"" and ""organization"" can spontaneously emerge in the absence of a designer. For example, philosopher and cognitive scientist Daniel Dennett has used the analogue of Conway's Life ""universe"" extensively to illustrate the possible evolution of complex philosophical constructs, such as consciousness and free will, from the relatively simple set of deterministic physical laws governing our own universe. The emergence of order from simple rules begs an interesting questionwhat happens if we set time backwards? This competition is an experiment to see if machine learning (or optimization, or any method) can predict the game of life in reverse. Is the chaotic start of Life predictable from its orderly ends? We have created many games, evolved them, and provided only the end boards. You are asked to predict the starting board that resulted in each end board. This is a Code Competition. Refer to Code Requirements for details.`'",tabular data,Conway's Reverse Game of Life 2020,,Reverse the arrow of time in the Game of Life,PostProcessorKernel,conways-reverse-game-of-life-2020 103,"'`The Inter-American Development Bank is asking the Kaggle community for help with income qualification for some of the world's poorest families. Are you up for the challenge? Here's the backstory: Many social programs have a hard time making sure the right people are given enough aid. Its especially tricky when a program focuses on the poorest segment of the population. The worlds poorest typically cant provide the necessary income and expense records to prove that they qualify. In Latin America, one popular method uses an algorithm to verify income qualification. Its called the Proxy Means Test (or PMT). With PMT, agencies use a model that considers a familys observable household attributes like the material of their walls and ceiling, or the assets found in the home to classify them and predict their level of need. While this is an improvement, accuracy remains a problem as the regions population grows and poverty declines. To improve on PMT, the IDB (the largest source of development financing for Latin America and the Caribbean) has turned to the Kaggle community. They believe that new methods beyond traditional econometrics, based on a dataset of Costa Rican household characteristics, might help improve PMTs performance. Beyond Costa Rica, many countries face this same problem of inaccurately assessing social need. If Kagglers can generate an improvement, the new algorithm could be implemented in other countries around the world. This is a Kernels-Only Competition, so you must submit your code through Kernels, rather than uploading .csv predictions. You can create private Kernels and even share/edit your work with teammates by adding them as collaborators.`'",tabular data,Costa Rican Household Poverty Level Prediction,,Can you identify which households have the highest need for social welfare assistance?,MacroFScore,costa-rican-household-poverty-level-prediction 104,"'`Recruit Ponpare is Japan's leading joint coupon site, offering huge discounts on everything from hot yoga, to gourmet sushi, to a summer concert bonanza. Ponpare's coupons open doors for customers they've only dreamed of stepping through. They can learn difficult to acquire skills, go on unheard of adventures, and dine like (and with) the stars. Investing in a new experience is not cheap. We fear wasting our time and money on a product or service that we may not enjoy or fully understand. Ponpare takes the high price out of this equation, making it easier for you to take the leap towards your first sky-dive or diamond engagement ring. Using past purchase and browsing behavior, this competition asks you to predict which coupons a customer will buy in a given period of time. The resulting models will be used to improve Ponpare's recommendation system, so they can make sure their customers don't miss out on their next favorite thing.`'",tabular data,Coupon Purchase Prediction,,Predict which coupons a customer will buy,MAP@{K},coupon-purchase-prediction 105,"'`This week 1 forecasting task is now closed for submissions. Click here to visit the week 2 version, and make a submission there. This is one of the two complementary forecasting tasks to predict COVID-19 spread. This task is based on various regions across the world. To start on a single state-level subcomponent, please see the companion forecasting task for California, USA. Background The White House Office of Science and Technology Policy (OSTP) pulled together a coalition research groups and companies (including Kaggle) to prepare the COVID-19 Open Research Dataset (CORD-19) to attempt to address key open scientific questions on COVID-19. Those questions are drawn from National Academies of Sciences, Engineering, and Medicines (NASEM) and the World Health Organization (WHO). The Challenge Kaggle is launching two companion COVID-19 forecasting challenges to help answer a subset of the NASEM/WHO questions. While the challenge involves forecasting confirmed cases and fatalities between March 25 and April 22 by region, the primary goal isn't to produce accurate forecasts. Its to identify factors that appear to impact the transmission rate of COVID-19. You are encouraged to pull in, curate and share data sources that might be helpful. If you find variables that look like they impact the transmission rate, please share your finding in a notebook. As the data becomes available, we will update the leaderboard with live results based on data made available from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). We have received support and guidance from health and policy organizations in launching these challenges. We're hopeful the Kaggle community can make valuable contributions to developing a better understanding of factors that impact the transmission of COVID-19. Companies and Organizations There is also a call to action for companies and other organizations: If you have datasets that might be useful, please upload them to Kaggles dataset platform and reference them in this forum thread. That will make them accessible to those participating in this challenge and a resource to the wider scientific community. Acknowledgements JHU CSSE for making the data available to the public. The White House OSTP for pulling together the key open questions. The image comes from the Center for Disease Control. This is a Code Competition. Refer to Code Requirements for details.`'",tabular data,COVID19 Global Forecasting (Week 1),,Forecast daily COVID-19 spread in regions around world,MCRMSLE,covid19-global-forecasting-(week-1) 106,"'`This week 3 forecasting task is now closed for submissions. Click here to visit the week 4 version, and make a submission there. This is week 3 of Kaggle's COVID19 forecasting series, following the Week 2 competition. This is the 3rd of at least 4 competitions we plan to launch in this series. All of the prior discussion forums have been migrated to this competition for continuity. Background The White House Office of Science and Technology Policy (OSTP) pulled together a coalition research groups and companies (including Kaggle) to prepare the COVID-19 Open Research Dataset (CORD-19) to attempt to address key open scientific questions on COVID-19. Those questions are drawn from National Academies of Sciences, Engineering, and Medicines (NASEM) and the World Health Organization (WHO). The Challenge Kaggle is launching a companion COVID-19 forecasting challenges to help answer a subset of the NASEM/WHO questions. While the challenge involves forecasting confirmed cases and fatalities between April 1 and April 30 by region, the primary goal isn't only to produce accurate forecasts. Its also to identify factors that appear to impact the transmission rate of COVID-19. You are encouraged to pull in, curate and share data sources that might be helpful. If you find variables that look like they impact the transmission rate, please share your finding in a notebook. As the data becomes available, we will update the leaderboard with live results based on data made available from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). We have received support and guidance from health and policy organizations in launching these challenges. We're hopeful the Kaggle community can make valuable contributions to developing a better understanding of factors that impact the transmission of COVID-19. Companies and Organizations There is also a call to action for companies and other organizations: If you have datasets that might be useful, please upload them to Kaggles dataset platform and reference them in this forum thread. That will make them accessible to those participating in this challenge and a resource to the wider scientific community. Acknowledgements JHU CSSE for making the data available to the public. The White House OSTP for pulling together the key open questions. The image comes from the Center for Disease Control. This is a Code Competition. Refer to Code Requirements for details.`'",tabular data,COVID19 Global Forecasting (Week 3),,Forecast daily COVID-19 spread in regions around world,MCRMSLE,covid19-global-forecasting-(week-3) 107,"'`This is week 5 of Kaggle's COVID-19 forecasting series, following the Week 4 competition. This competition has some changes from prior weeks - be sure to check the Evaluation and Data pages for more details. All of the prior discussion forums have been migrated to this competition for continuity. Background The White House Office of Science and Technology Policy (OSTP) pulled together a coalition research groups and companies (including Kaggle) to prepare the COVID-19 Open Research Dataset (CORD-19) to attempt to address key open scientific questions on COVID-19. Those questions are drawn from National Academies of Sciences, Engineering, and Medicines (NASEM) and the World Health Organization (WHO). The Challenge Kaggle is launching a companion COVID-19 forecasting challenges to help answer a subset of the NASEM/WHO questions. While the challenge involves developing quantile estimates intervals for confirmed cases and fatalities between May 12 and June 7 by region, the primary goal isn't only to produce accurate forecasts. Its also to identify factors that appear to impact the transmission rate of COVID-19. You are encouraged to pull in, curate and share data sources that might be helpful. If you find variables that look like they impact the transmission rate, please share your finding in a notebook. As the data becomes available, we will update the leaderboard with live results based on data made available from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). We have received support and guidance from health and policy organizations in launching these challenges. We're hopeful the Kaggle community can make valuable contributions to developing a better understanding of factors that impact the transmission of COVID-19. Companies and Organizations There is also a call to action for companies and other organizations: If you have datasets that might be useful, please upload them to Kaggles dataset platform and reference them in this forum thread. That will make them accessible to those participating in this challenge and a resource to the wider scientific community. Acknowledgements JHU CSSE for making the data available to the public. The White House OSTP for pulling together the key open questions. The image comes from the Center for Disease Control. This is a Code Competition. Refer to Code Requirements for details.`'",tabular data,COVID19 Global Forecasting (Week 5),,Forecast daily COVID-19 spread in regions around world,WeightedPinballLoss,covid19-global-forecasting-(week-5) 108,"'`This is one of the two complementary forecasting tasks to predict COVID-19 spread. This one is based on a single state-level subcomponent in California, USA. Our intent in having this region-specific version is to offer a more manageable starting point for the global forecasting task. To start on the global version, please see the companion forecasting task. Background The White House Office of Science and Technology Policy (OSTP) pulled together a coalition research groups and companies (including Kaggle) to prepare the COVID-19 Open Research Dataset (CORD-19) to attempt to address key open scientific questions on COVID-19. Those questions are drawn from National Academies of Sciences, Engineering, and Medicines (NASEM) and the World Health Organization (WHO). The Challenge Kaggle is launching two companion COVID-19 forecasting challenges to help answer a subset of the NASEM/WHO questions. While the challenge involves forecasting confirmed cases and fatalities between March 25 and April 22 in California, the primary goal isn't to produce accurate forecasts. Its to identify factors that appear to impact the transmission rate of COVID-19. You are encouraged to pull in, curate and share data sources that might be helpful. If you find variables that look like they impact the transmission rate, please share your finding in a notebook. As the data becomes available, we will update the leaderboard with live results based on data made available from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). We have received support and guidance from health and policy organizations in launching these challenges. We're hopeful the Kaggle community can make valuable contributions to developing a better understanding of factors that impact the transmission of COVID-19. Companies and Organizations There is also a call to action for companies and other organizations: If you have datasets that might be useful, please upload them to Kaggles dataset platform and reference them in this forum thread. That will make them accessible to those participating in this challenge and a resource to the wider scientific community. Acknowledgements JHU CSSE for making the data available to the public. The White House OSTP for pulling together the key open questions. The image comes from the Center for Disease Control. This is a Code Competition. Refer to Code Requirements for details.`'",tabular data,COVID19 Local US-CA Forecasting (Week 1),,"Forecast daily COVID-19 spread in California, USA",MCRMSLE,covid19-local-us-ca-forecasting-(week-1) 109,"'`Display advertising is a billion dollar effort and one of the central uses of machine learning on the Internet. However, its data and methods are usually kept under lock and key. In this research competition, CriteoLabs is sharing a weeks worth of data for you to develop models predicting ad click-through rate (CTR). Given a user and the page he is visiting, what is the probability that he will click on a given ad? The goal of this challenge is to benchmark the most accurate ML algorithms for CTR estimation. All winning models will be released under an open source license. As a participant, you are given a chance to access the traffic logs from Criteo that include various undisclosed features along with the click labels. `'",tabular data,Display Advertising Challenge,research,Predict click-through rates on display ads,LogLoss,display-advertising-challenge 110,"'`So many of our favorite daily activities are mediated by proprietary search algorithms. Whether you're trying to find a stream of that reality TV show on cat herding or shopping an eCommerce site for a new set of Japanese sushi knives, the relevance of search results is often responsible for your (un)happiness. Currently, small online businesses have no good way of evaluating the performance of their search algorithms, making it difficult for them to provide an exceptional customer experience. The goal of this competition is to create an open-source model that can be used to measure the relevance of search results. In doing so, you'll be helping enable small business owners to match the experience provided by more resource rich competitors. It will also provide more established businesses a model to test against. Given the queries and resulting product descriptions from leading eCommerce sites, this competition asks you to evaluate the accuracy of their search algorithms. Make a first submission with this Python benchmark on Kaggle scripts. The dataset for this competition was created using query-result pairings enriched on the CrowdFlower platform. They are sponsoring this competition as an investment in the open-source data science community. A dataset collected, cleaned, and labeled by CrowdFlower can make your supervised machine learning dreams come true.`'",tabular data,Crowdflower Search Results Relevance,,Predict the relevance of search results from eCommerce sites,QuadraticWeightedKappa,crowdflower-search-results-relevance 111,"'`Understanding how and why we are here is one of the fundamental questions for the human race. Part of the answer to this question lies in the origins of galaxies, such as our own Milky Way. Yet questions remain about how the Milky Way (or any of the other ~100 billion galaxies in our Universe) was formed and has evolved. Galaxies come in all shapes, sizes and colors: from beautiful spirals to huge ellipticals. Understanding the distribution, location and types of galaxies as a function of shape, size, and color are critical pieces for solving this puzzle. The Whirlpool Galaxy (M51). Credit: NASA and European Space Agency With each passing day telescopes around and above the Earth capture more and more images of distant galaxies. As better and bigger telescopes continue to collect these images, the datasets begin to explode in size. In order to better understand how the different shapes (or morphologies) of galaxies relate to the physics that create them, such images need to be sorted and classified. Kaggle has teamed up with Galaxy Zoo and Winton Capital to produce the Galaxy Challenge, where participants will help classify galaxies into categories. Image Credit: ESA/Hubble & NASA Galaxies in this set have already been classified once through the help of hundreds of thousands of volunteers, who collectively classified the shapes of these images by eye in a successful citizen science crowdsourcing project. However, this approach becomes less feasible as data sets grow to contain of hundreds of millions (or even billions) of galaxies. That's where you come in. This competition asks you to analyze the JPG images of galaxies to find automated metrics that reproduce the probability distributions derived from human classifications. For each galaxy, determine the probability that it belongs in a particular class. Can you write an algorithm that behaves as well as the crowd does? Contributors: D. Harvey, C. Lintott, T. Kitching, P. Marshall, K. Willett, Galaxy Zoo Acknowledgments The Contributors and the rest of the Galaxy Zoo and Kaggle teams would like to say a big thank you to Winton Capital for helping make this happen. Without their support, we would have not been able to make this competition go ahead.`'",image data,Galaxy Zoo - The Galaxy Challenge,research,Classify the morphologies of distant galaxies in our Universe,RMSE,galaxy-zoo-the-galaxy-challenge 112,"'`The objective of this task is to predict keypoint positions on face images. This can be used as a building block in several applications, such as: tracking faces in images and video analysing facial expressions detecting dysmorphic facial signs for medical diagnosis biometrics / face recognition Detecing facial keypoints is a very challenging problem. Facial features vary greatly from one individual to another, and even for a single individual, there is a large amount of variation due to 3D pose, size, position, viewing angle, and illumination conditions. Computer vision research has come a long way in addressing these difficulties, but there remain many opportunities for improvement. This getting-started competition provides a benchmark data set and an R tutorial to get you going on analysing face images. Get started with R >> Acknowledgements The data set for this competition was graciously provided by Dr. Yoshua Bengio of the University of Montreal. James Petterson.`'",image data,Facial Keypoints Detection,gettingStarted,Detect the location of keypoints on face images,RMSE,facial-keypoints-detection 113,"'`Get started on this competition with Kaggle Scripts. No data download or local environment needed! Random forests? Cover trees? Not so fast, computer nerds. We're talking about the real thing. In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices. Acknowledgements Kaggle is hosting this competition for the machine learning community to use for fun and practice. This dataset was provided by Jock A. Blackard and Colorado State University. We also thank the UCI machine learning repository for hosting the dataset. If you use the problem in publication, please cite: Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science`'",tabular data,Forest Cover Type Prediction,playground,Use cartographic variables to classify forest categories,CategorizationAccuracy,forest-cover-type-prediction 114,"'`In their first recruiting competition, Telstra is challenging Kagglers to predict the severity of service disruptions on their network. Using a dataset of features from their service logs, you're tasked with predicting if a disruption is a momentary glitch or a total interruption of connectivity. Telstra is on a journey to enhance the customer experience - ensuring everyone in the company is putting customers first. In terms of its expansive network, this means continuously advancing how it predicts the scope and timing of service disruptions. Telstra wants to see how you would help it drive customer advocacy by developing a more advanced predictive model for service disruptions and to help it better serve its customers. This challenge was crafted as a simulation of the type of problem you might tackle as a member of the team at Telstra. Kagglers who stand out will be considered for data science roles in Telstra's Big Data team in Telstras absolute discretion. Highly-ranked participants will combine technical expertise and intuition in data science problems with a keen business sense and an effortless ability to work with technical and non-technical staff to turn data into real changes that impact customers. Highly-ranked participants will be considered by Telstra for interviews for employment, based on their work in the Competition and ability to meet the selection criteria for any suitable open job vacancy in Melbourne and Sydney, Australia. Participation in this Competition is not a recruitment process and Kaggle does not provide Telstra with recruitment services.`'",tabular data,Telstra Network Disruptions,recruitment,Predict service faults on Australia's largest telecommunications network ,MulticlassLoss,telstra-network-disruptions 115,"'`In 2013, we hosted one of our favorite for-fun competitions: Dogs vs. Cats. Much has since changed in the machine learning landscape, particularly in deep learning and image analysis. Back then, a tensor flow was the diffusion of the creamer in a bored mathematician's cup of coffee. Now, even the cucumber farmers are neural netting their way to a bounty. Much has changed at Kaggle as well. Our online coding environment Kernels didn't exist in 2013, and so it was that we approached sharing by scratching primitive glpyhs on cave walls with sticks and sharp objects. No more. Now, Kernels have taken over as the way to share code on Kaggle. IPython is out and Jupyter Notebook is in. We even have TensorFlow. What more could a data scientist ask for? But seriously, what more? Pull requests welcome. We are excited to bring back the infamous Dogs vs. Cats classification problem as a playground competition with kernels enabled. Although modern techniques may make light of this once-difficult problem, it is through practice of new techniques on old datasets that we will make light of machine learning's future challenges.`'",image data,Dogs vs. Cats Redux: Kernels Edition,playground,Distinguish images of dogs from cats,LogLoss,dogs-vs.-cats-redux:-kernels-edition 116,"'`Finding the perfect place to call your new home should be more than browsing through endless listings. RentHop makes apartment search smarter by using data to sort rental listings by quality. But while looking for the perfect apartment is difficult enough, structuring and making sense of all available real estate data programmatically is even harder. Two Sigma and RentHop, a portfolio company of Two Sigma Ventures, invite Kagglers to unleash their creative engines to uncover business value in this unique recruiting competition. Two Sigma invites you to apply your talents in this recruiting competition featuring rental listing data from RentHop. Kagglers will predict the number of inquiries a new listing receives based on the listings creation date and other features. Doing so will help RentHop better handle fraud control, identify potential listing quality issues, and allow owners and agents to better understand renters needs and preferences. Two Sigma has been at the forefront of applying technology and data science to financial forecasts. While their pioneering advances in big data, AI, and machine learning in the financial world have been pushing the industry forward, as with all other scientific progress, they are driven to make continual progress. This challenge is an opportunity for competitors to gain a sneak peek into Two Sigma's data science work outside of finance. Acknowledgments This competition is co-hosted by Two Sigma and RentHop (a portfolio company of Two Sigma Ventures, which is a division of Two Sigma Investments) to encourage creativity in using real world data to solve everyday problems.`'",tabular data,Two Sigma Connect: Rental Listing Inquiries,recruitment,How much interest will a new rental listing on RentHop receive?,MulticlassLoss,two-sigma-connect:-rental-listing-inquiries 117,'``',tabular data,Test Competition Please Ignore,inClass,Time flies like an arrow. Fruit flies like banana.,categorizationaccuracy,test-competition-please-ignore 118,"'`We are conducting the very first Fieldguide Challenge, a FGVCx challenge focusing on fine-grained classification of Lepidoptera (moths & butterflies). There are an estimated 175,000 species of Lepidoptera (leps), most of which are rarely photographed. Species identification is further complicated by distinct adult and immature stages (caterpillars), and often a lack of concrete visual diagnostics for separation between near-identical species. The goal of the competition is to push state-of-the-art classification of real world data that contains high class imbalance and high infraclass variability. Moths and butterflies are by far the most photographed group of animals outside of birds. Having an effective CV component to identification apps would engage tens of thousands of avid citizen scientists. It would allow us to create effective monitoring programs through promotion of night-lighting for moths, which are sensitive environmental indicators. We estimate that at least 50,000 species of moths and butterflies can be identified by images, and we have citizen science groups throughout the world ready to put their smartphone cameras to use. The Fieldguide Challenge dataset contains over 5000 species, with a combined training and testing set of over 530,000 images that have been collected and verified by multiple authors of Fieldguide. The dataset features visually-similar species, many of which are highly polymorphic, various stages in the species lifecycle (larvae vs adult), as well as species that are often misidentified as a moth/butterfly (e.g. caddisflies). The winning team may have the opportunity to present their results at the FGVC workshop at CVPR in Long Beach. The host would like to work with winning teams to build better models for Fieldguide.ai. Competition Objectives Improve classification accuracy over a dataset that contains high class imbalance and high heterogeneity (caterpillar vs. adult, wild vs pinned specimens, etc.) Improve error detection accuracy - what is not-a-lep* *Not-a-lep is a single, mixed-bag class containing images commonly mistaken as leps. For example: leafhoppers, caddisflies, data labels, and flowers often included in the background of moth/butterfly photos, etc.`'",image data,Fieldguide Challenge: Moths & Butterflies,,Improve classification accuracy over a dataset that contains high class imbalance and high heterogeneity,meanbesterroratk,fieldguide-challenge:-moths-&-butterflies 119,"'`Mini Machine Learning Competition This is a private ML competition for students of Data Mining and Probabilistic Reasoning class at University of Tbingen WS18/19. Data This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. Goal The Goal is to create a ML model that predicts if a client will default on credit card payment in next month.`'",tabular data,[DM&PR WS18/19] Machine learning competition,,"A private ML competition for DM&PR students, Uni Tübingen",auc,[dm&pr-ws18/19]-machine-learning-competition 120,"'`Pima Indians Diabetes Database Number of Instances: 768 Number of Attributes: 8 plus class Columns Description: Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg) Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml) Body mass index (weight in kg/(height in m)^2) Diabetes pedigree function Age (years) Class variable (0 or 1) Class Distribution: (class value 1 is interpreted as ""tested positive for diabetes"") 0 --- 500 1 ---- 268`'",tabular data,Diabetes Classification,inClass,Predict the given person is suffering from diabetes.,categorizationaccuracy,diabetes-classification 121,"'`This is your first evaluative lab. You have to predict the Average Price!`'",tabular data,Regression Evaluative Lab,inClass,This is your first evaluative lab and it is a regression problem!,mse,regression-evaluative-lab 122,"'`As a credit company, it is important to know beforehand who is able to pay their loans and who is not. The goal of this puzzle is to build a statistical/machine learning model to figure out which clients are able to honor their debt.`'",tabular data,FIA ML T5,inClass,Machine Learning,auc,fia-ml-t5 123,"'`75.06/95.58 Organizacin de Datos Segundo Cuatrimestre de 2018 Trabajo Prctico 2: Enunciado El segundo trabajo prctico es una competencia de Machine Learning en donde cada grupo debe intentar determinar, para cada usuario presentado, cul es la probabilidad de que ese usuario realice una conversin en Trocafone en un periodo determinado.`'",tabular data,Predicting user conversions,,Estimar la probabilidad de que un usuario concrete una compra.,auc,predicting-user-conversions 124,"'`Data-Driven Business Analytics Erster Anwendungsfall In dieser Herausforderung sehen Sie sich damit konfrontiert, den Preis von Gebrauchtwagen zu prognostizieren.`'",tabular data,Data-Driven Business Analytics,inClass,WS 19/20 3. Projekt,r2score,data-driven-business-analytics 125,"'`After their gaming startup PlayDoom couldnt garner funding in the Entrepreneurship Summit, the team of Bhadage, Bansal and Pandey decided to give up on their startup and go on a world trip to follow their passion of numismatics (collection of coins). They travel to various countries and collect coins of various denominations to add to their collection. Now, theyve hired a data scientist Madhup to create a tool to predict countries from their coin currencies, by making use of their marvellous coin collection. Can you help Madhup with the tool? Note - If you are applying CNN, be ready with keras scratch code as well as transfer learning code. Data has been updated. So kindly re-download the data.`'",image data,Gallivanters,inClass,Help data scientist to create a tool to predict countries from their coin currencies,categorizationaccuracy,gallivanters 126,"'`Practice of Data Analysis - For New Generation & Intern This is the first challenge for Intern & New Generation. Data analysis is an important skill for everyone. Especially, AI, Machine Learning and Deep Learing are better techniques for data analysis. You will learn more and more while participating this competition. If you have any problem, you can write something in Discussion. Also, we are willing to see you share your code & idea in Kernels.`'",tabular data,Python Class - Practice,inClass,Data Analysis - Practice For New Generation & Intern,auc,python-class-practice 127,"'`Goal: The goal of this project is to apply everything we've covered so far in a new context: classification of heart disease. We've covered many of the central components of machine learning and several algorithms. Now you apply them. Specifically, you will use a set of 14 features to predict a single outcome: whether or not an individual patient has heart disease. This competition will help introduce you to some of the differences (and the similarities) between classification problems and regression problems.`'",tabular data,EC524: Heart-disease classification,inClass,Classify individuals' risk for heart disease,meanfscore,ec524:-heart-disease-classification 128,"'`Your task is to classify news articles into tree categories: news, clickbait and other. UPD 21.04: added unlabeled data`'",text data,DL in NLP Spring 2019. Classification,inClass,Train a clickbait detector,meanfscore,dl-in-nlp-spring-2019.-classification 129,"'`This competititon is hosted by data analysis club IIT Palakkad. The challenge is to predict the price of each house given some information related to houses. Goal of competition is to make you familiar with environment of kaggle and basics of regression. Go through the data page and evaluation page for further details. For submitting predictions click Submit Predictions. Maximum submission per day is 5.`'",tabular data,House Price Prediction,inClass,predict the price of house,rmse,house-price-prediction 130,"'`Human brain research is among the most complex areas of study for scientists. We know that age and other factors can affect its function and structure, but more research is needed into what specifically occurs within the brain. With much of the research using MRI scans, data scientists are well positioned to support future insights. In particular, neuroimaging specialists look for measurable markers of behavior, health, or disorder to help identify relevant brain regions and their contribution to typical or symptomatic effects. In this competition, you will predict multiple assessments plus age from multimodal brain MRI features. You will be working from existing results from other data scientists, doing the important work of validating the utility of multimodal features in a normative population of unaffected subjects. Due to the complexity of the brain and differences between scanners, generalized approaches will be essential to effectively propel multimodal neuroimaging research forward. The Tri-Institutional Georgia State University/Georgia Institute of Technology/Emory University Center for Translational Research in Neuroimaging and Data Science (TReNDS) leverages advanced brain imaging to promote research into brain health. The organization is focused on developing, applying and sharing advanced analytic approaches and neuroinformatics tools. Among its software projects are the GIFT and FIT neuroimaging toolboxes, the COINS data management system, and the COINSTAC toolkit for federated learning, all aimed at supporting data scientists and other neuroimaging researchers. Making the leap from research to clinical application is particularly difficult in brain health. In order to translate to clinical settings, research findings have to be reproduced consistently and validated in out-of-sample instances. The problem is particularly well-suited for data science, but current approaches typically do not generalize well. With this large dataset and competition, your efforts could directly address an important area of brain research. Acknowledgments`'",image data,TReNDS Neuroimaging,research,"Multiscanner normative age and assessments prediction with brain function, structure, and connectivity",WMAE,trends-neuroimaging 131,"'`This is the home page of the competition. As the description of the competition suggests you are given a dataset with 199 attributes (including ID and Class) and 13000 instances and your task is to classify the given dataset into 5 classes [1,2,3,4,5] using only Clustering algorithms.`'",tabular data,DM-Assignment 1,inClass,Classify by Clustering!,categorizationaccuracy,dm-assignment-1 132,'` 15 `',tabular data,Focus start 2020,inClass,Предсказание задержек рейсов более чем на 15 минут,auc,focus-start-2020 133,"'` . RTSD 48x48 . 66 , 66 . , . !`'",image data,Traffic signs classification,inClass,Russian road traffic signs classification,categorizationaccuracy,traffic-signs-classification 134,"'`Introduction Machine Learning is used across many spheres around the world. The healthcare industry is no exception. Machine Learning can play an essential role in predicting the presence/absence of Locomotor disorders, Heart diseases, and more. Such information, if predicted well in advance, can provide important insights to doctors who can then adapt their diagnosis and treatment per-patient basis. In this competition, your task is to use labeled data to train a machine-learning algorithm to be able to predict on unseen test data. There are 13 predictor attributes as described in the Data page. All the best. This is the place where all the knowledge learned so far would come to application.`'",tabular data,Heart Disease Prediction,inClass,Predict whether a patient had a heart disease,categorizationaccuracy,heart-disease-prediction 135,"'`This competition is for the summer interns at LAS (and any other LAS members who want to participate). The goal is to predict the sentiment of tweets scraped from Twitter. Awards The public leader-board is calculated on 40% of the test data. The private leader-board is calculated on the other 60%, and will be revealed after the competition ends. Winners are determined by the private leaderboard If you can, present your code in a notebook so that the other competitors can see them. If you create a cool visualization in your notebook, we may be able to show off your work to the lab.`'",text data,Tweet Sentiment Analysis,inClass,Sentiment analysis of tweets on a theme,auc,tweet-sentiment-analysis 136,"'`Prezicerea popularitii cntecelor de pe Spotify n aceast competiie va trebui s prezicei popularitatea unui cntec de pe Spotify analiznd datele din baza de date propus. Pe baza a ceea ce ai nvat la GirlsGoIT summer camp, va trebui s creai un model de clasificare/regresie care s prezic pentru fiecare cntec popularitatea - de la 1 la 10 Reguli: Competiia se adreseaz doar participantelor la tabra de var GirlsGoIT 2020 Codul trebuie s fie scris n **Python**, alte limbaje de programare nu se accept!!! Se vor utiliza doar modelele nvate la tabra de var GirlsGoIT! Nu copiai ideile altor echipe, s-ar putea ca ideea voastr s fie mai bun! Vei ncrca doar fiierul cu predicii predictions.csv, iar prezentrile le vei trimite separat trainerilor GirlsGoIT! Evaluarea finala Evaluarea finala va consta n rezultatul obinut la aceast competiie pe kaggle i evaluarea prezentrii powerpoint, n care vei include descrierea proiectului, grafice, etc.`'",tabular data,GirlsGoIT competition 2020,inClass,Competition for the GirlsGoIT Data Science Summer Camp 2020,categorizationaccuracy,girlsgoit-competition-2020 137,'`Competencia Laboratorio de Deep Learning ITBA`',image data,Fashion MNIST-ITBA-LAB 2020,inClass,Clasificar las imagenes entre las 10 categorías,categorizationaccuracy,fashion-mnist-itba-lab-2020 138,"'` , Avito . Baseline `'",text data,Text classification,inClass,Многоклассовая классификация текстов объявлений,categorizationaccuracy,text-classification 139,'`On eclass`',tabular data,Find me that fish,inClass,Predict the Probability of Occurence for the Marine Species Engraulis Encrasicolus,rmse,find-me-that-fish 140,"'`When you're driving, how important is it to be able to quickly tell the difference between a person vs. a stop sign? It's a hugely important, but typically very simple, distinction that you would make reflexively. Autonomous vehicles are not able to do this quite as effortlessly. This challenge, hosted by the 2018 CVPR workshop on autonomous driving (WAD), asks you to help give autonomously driven vehicles the same edge. Using an unprecedented dataset, you're asked to segment movable objects, such as cars and pedestrians, at instance level within image frames. By participating in this competition, you'll be helping to further our understand of the current status of computer vision algorithms in solving environmental perception problems for autonomous driving. This challenge is a truly unique opportunity to work on a tremendously high value and high profile problem. The dataset presented here contains over 10 times more fine-labeled images than the largest public dataset of its type. Acknowledgements This competition is hosted by the 2018 CVPR workshop on autonomous driving (WAD), with dataset and evaluation metric contributed by Baidu Inc.`'",video,CVPR 2018 WAD Video Segmentation Challenge,,Can you segment each objects within image frames captured by vehicles?,CVPRAutoDrivingAveragePrecision,cvpr-2018-wad-video-segmentation-challenge 141,"'`DanburyAI: June 2018 Workshop Competition The Street View House Numbers (SVHN) Dataset SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images. 10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10. MNIST-like 32-by-32 images centered around a single character (many of the images do contain some distractors at the sides). Here is a selection of what the individual training images look like: Source Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. (PDF) Web source: http://ufldl.stanford.edu/housenumbers/`'",image data,DanburyAI: June 2018 Workshop,,Can you create a neural network to classify street view house number digits?,categorizationaccuracy,danburyai:-june-2018-workshop 142,"'`Goal It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. Metric Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.) Submission Format can see at submission file`'",tabular data,DASPRO DATA HACKATHON,,House Pricing Problem. Let predict the price!,rmse,daspro-data-hackathon 143,"'`Welcome to General Assembly DC's 31st Part Time Data Science Competition. Tonight, you'll be trying to hack away at Shuttle Radiator data trying to classify potential anomalies in the radiator position onboard Space Shuttles. NUMBER OF EXAMPLES training set 43500 test set 14500 NUMBER OF ATTRIBUTES 9 The shuttle dataset contains 9 attributes all of which are numerical. The first one being time. The last column is the class which has been coded as follows : 1 Rad Flow 2 Fpv Close 3 Fpv Open 4 High 5 Bypass 6 Bpv Close 7 Bpv Open Approximately 80% of the data belongs to class 1. Therefore the default accuracy is about 80%. The aim here is to obtain an accuracy of 99 - 99.9%. Our evaluation metric is accuracy for this competition. Acknowledgements We thank NASA and UCI for providing this dataset.`'",,DAT31 Shuttle Diaster,inClass,Predicting NASA Space Shuttle Part Failure,categorizationaccuracy,dat31-shuttle-diaster 144,"'`This event caters the needs of all those who find data interesting, amusing and excel in telling stories using data and can show the world the power of data! Data is now a very crucial part of any business and research and most of them heavily depends upon finding patterns in data. This event will aim on making people aware of finding and predicting searching results using data and hence helping various business excelling in real life problems given to them. Format: Participants will be given data in form of .csv format. Allowed platforms for coding are R (using R studio), Python (using Spyder or Jupyter Notebook) and Excel. Competition will be hosted online through Kaggle. Participated are supposed to do exploratory analysis on code. Task will be divided into Data Pre-processing, Data Training, Data Prediction Participants are allowed to submit their entry for maximum of 20 times max. Participants overall improvement of accuracy along with final accuracy, complexity of model used. Plagiarism of any form if found will lead to direct disqualification of team. Final submission will be taken in form of .csv file where participants have to submit their predictions of respective id(s) and detailed explanation of approached used by them Fee : NIL Prize Money : 7500 total`'",,DATA ANALYTICS CHALLENGE PRODIGY'18,inClass,"Data Analytics Event by Prodigy, NIT Trichy",auc,data-analytics-challenge-prodigy18 145,"'`In the United States, lung cancer strikes 225,000 people every year, and accounts for $12 billion in health care costs. Early detection is critical to give patients the best chance at recovery and survival. One year ago, the office of the U.S. Vice President spearheaded a bold new initiative, the Cancer Moonshot, to make a decade's worth of progress in cancer prevention, diagnosis, and treatment in just 5 years. In 2017, the Data Science Bowl will be a critical milestone in support of the Cancer Moonshot by convening the data science and medical communities to develop lung cancer detection algorithms. Using a data set of thousands of high-resolution lung scans provided by the National Cancer Institute, participants will develop algorithms that accurately determine when lesions in the lungs are cancerous. This will dramatically reduce the false positive rate that plagues the current detection technology, get patients earlier access to life-saving interventions, and give radiologists more time to spend with their patients. This year, the Data Science Bowl will award $1 million in prizes to those who observe the right patterns, ask the right questions, and in turn, create unprecedented impact around cancer screening care and prevention. The funds for the prize purse will be provided by the Laura and John Arnold Foundation. Visit DataScienceBowl.com to: Sign up to receive news about the competition Learn about the history of the Data Science Bowl and past competitions Read our latest insights on emerging analytics techniques Acknowledgments The Data Science Bowl is presented by Competition Sponsors Laura and John Arnold Foundation Cancer Imaging Program of the National Cancer Institute American College of Radiology Amazon Web Services NVIDIA Data Support Providers National Lung Screening Trial The Cancer Imaging Archive Diagnostic Image Analysis Group, Radboud University Lahey Hospital & Medical Center Copenhagen University Hospital Supporting Organizations Bayes Impact Black Data Processng Associates Code the Change Data Community DC DataKind Galvanize Great Minds in STEM Hortonworks INFORMS Lesbians Who Tech NSBE Society of Asian Scientists & Engineers Society of Women Engineers University of Texas Austin, Business Analytics Program, McCombs School of Business US Dept. of Health and Human Services US Food and Drug Administration Women in Technology Women of Cyberjutsu`'",,Data Science Bowl 2017,,Can you improve lung cancer detection?,LogLoss,data-science-bowl-2017 146,"'`Spot Nuclei. Speed Cures. Imagine speeding up research for almost every disease, from lung cancer and heart disease to rare disorders. The 2018 Data Science Bowl offers our most ambitious mission yet: create an algorithm to automate nucleus detection. Weve all seen people suffer from diseases like cancer, heart disease, chronic obstructive pulmonary disease, Alzheimers, and diabetes. Many have seen their loved ones pass away. Think how many lives would be transformed if cures came faster. By automating nucleus detection, you could help unlock cures fasterfrom rare disorders to the common cold. Want a snapshot about the 2018 Data Science Bowl? View this video. Why nuclei? Identifying the cells nuclei is the starting point for most analyses because most of the human bodys 30 trillion cells contain a nucleus full of DNA, the genetic code that programs each cell. Identifying nuclei allows researchers to identify each individual cell in a sample, and by measuring how cells react to various treatments, the researcher can understand the underlying biological processes at work. By participating, teams will work to automate the process of identifying nuclei, which will allow for more efficient drug testing, shortening the 10 years it takes for each new drug to come to market. Check out this video overview to find out more. What will participants do? Teams will create a computer model that can identify a range of nuclei across varied conditions. By observing patterns, asking questions, and building a model, participants will have a chance to push state-of-the-art technology farther. Visit DataScienceBowl.com to: Sign up to receive news about the competition Learn about the history of the Data Science Bowl and past competitions Read our latest insights on emerging analytics techniques`'",,2018 Data Science Bowl,,Find the nuclei in divergent images to advance medical discovery,custom metric,2018-data-science-bowl 147,"'`A Competio O Data Science Challenge @ ITA uma competio de cincia de dados estudantil promovida anualmente no Instituto Tecnolgico de Aeronutica (ITA) desde 2019. Nesta edio, o desafio compreende a construo de um modelo de aprendizado de mquina para manuteno preditiva em aeronaves. Os competidores tero acesso a uma base de dados confidencial fictcia baseada em dados reais. Os dados consistem em medies de sensores e mensagens da aeronave ao longo do tempo, alm das datas e horrios das trocas preventivas. Acesse o regulamento aqui. Inscries Sero permitidas inscries apenas de alunos atualmente matriculados ou que estiveram matriculados em 2019 em algum curso de graduao ou ps-graduao das seguintes instituies de ensino: Instituto Tecnolgico de Aeronutica (ITA) Universidade Federal de So Paulo (UNIFESP) Instituto Federal de So Paulo (IFSP) Faa sua inscrio aqui. Apoio ITAEx InovaLab IEAv BCG Gamma Organizao Prof. Filipe Verri (coordenador), Diviso de Cincia da Computao, ITA Prof. Elton Sbruzzi, Diviso de Cincia da Computao, ITA Prof. Vitor Curtis, Diviso de Cincia da Computao, ITA Prof. Ana Lorena, Diviso de Cincia da Computao, ITA`'",,Data Science Challenge at ITA 2020,,Competição de ciência de dados para alunos do ITA/UNIFESP/IFSP.,wrmse,data-science-challenge-at-ita-2020 148,"'`Welcome In this competition you'll notice there isn't a leaderboard, and you are not required to develop a predictive model. This isn't a traditional supervised Kaggle machine learning competition. CareerVillage.org is a nonprofit that crowdsources career advice for underserved youth. Founded in 2011 in four classrooms in New York City, the platform has now served career advice from 25,000 volunteer professionals to over 3.5M online learners. The platform uses a Q&A style similar to StackOverflow or Quora to provide students with answers to any question about any career. In this Data Science for Good challenge, CareerVillage.org, in partnership with Google.org, is inviting you to help recommend questions to appropriate volunteers. To support this challenge, CareerVillage.org has supplied five years of data. Problem Statement The U.S. has almost 500 students for every guidance counselor. Underserved youth lack the network to find their career role models, making CareerVillage.org the only option for millions of young people in America and around the globe with nowhere else to turn. To date, 25,000 volunteers have created profiles and opted in to receive emails when a career question is a good fit for them. This is where your skills come in. To help students get the advice they need, the team at CareerVillage.org needs to be able to send the right questions to the right volunteers. The notifications sent to volunteers seem to have the greatest impact on how many questions are answered. Your objective: develop a method to recommend relevant questions to the professionals who are most likely to answer them. Criteria for Measuring Solutions Performance: How well does the solution match professionals to the questions they would be motivated to answer? CareerVillage.org will not be able to live-test every submission, so a strong entry will clearly articulate why it will be effective at motivating answers. Easy to implement: The CareerVillage.org team wants to put the winning submissions to work, quickly. A good entry will be well documented and easy to test in production. Extensibility: In the future, CareerVillage.org aims to add more data features and to accommodate new objectives. Winning submissions should allow for this and other augmentations to be added in the future.`'",,Data Science for Good: CareerVillage.org,analytics,Match career advice questions with professionals in the field,people,data-science-for-good:-careervillage.org 149,"'`Data Science for Good: City of Los Angeles Help the City of Los Angeles to structure and analyze its job descriptions The City of Los Angeles faces a big hiring challenge: 1/3 of its 50,000 workers are eligible to retire by July of 2020. The city has partnered with Kaggle to create a competition to improve the job bulletins that will fill all those open positions. Problem Statement The content, tone, and format of job bulletins can influence the quality of the applicant pool. Overly-specific job requirements may discourage diversity. The Los Angeles Mayors Office wants to reimagine the citys job bulletins by using text analysis to identify needed improvements. The goal is to convert a folder full of plain-text job postings into a single structured CSV file and then to use this data to: (1) identify language that can negatively bias the pool of applicants; (2) improve the diversity and quality of the applicant pool; and/or (3) make it easier to determine which promotions are available to employees in each job class. How to Participate Accept the Rules Accept the competition rules. Make Your Submission Follow the submission instructions. WIth your help, Los Angeles will overcome a wave of retirements and fill those jobs with a strong and diverse workforce. Good luck and happy Kaggling! Do you think companies can find better candidates by improving their job postings? We hope to create an open-sourced body of work focused on this topic by hosting another Data Science for Good competition, this time in partnership with the City of Los Angeles.`'",,Data Science for Good: City of Los Angeles,,Help the City of Los Angeles to structure and analyze its job descriptions,employment,data-science-for-good:-city-of-los-angeles 150,"'`hosting a meetup on Scikit-learn We encourage participants to post code via the ""Tutorials"" link on the left. Don't worry about accuracy or whether your code is perfect. The aim here is to explore sklearn by using it. Its implementation is high quality due to s Meetup Information Thursday, March 7, 2013, Learning in Python with scikit-learn"" by Andreas Mueller ""Parallel and large scale learning with scikit-learn"" by Olivier Grisel notebook interface How to perform scalable text feature extraction with the Hashing Trick and hyper parameters tuning How to optimize memory usage with memory mapping How to approximate kernel Support Vector Machines for large scale datasets A short introduction to Ensembles with model averaging and Random Forests by day and a Python machine learning hacker by night. He is interested in applications to Natural Language Processing, Computer Vision and predictive modelling.`'",,Data Science London + Scikit-learn,,Scikit-learn is an open-source machine learning library for Python. Give it a try here!,CategorizationAccuracy,data-science-london-+-scikit-learn 151,"'`Description Emotions are expressed in nuanced ways, which varies by collective or individual experiences, knowledge, and beliefs. Therefore, to understand emotion, as conveyed through text, a robust mechanism capable of capturing and modeling different linguistic nuances and phenomena is needed. Thus, in this competition, we provide a dataset which was crawled from Twitter, and we have already labeled the emotion for these tweets by some specific hashtags in the original text. There are 8 classes (or say emotions) in our dataset: anger, anticipation, disgust, fear, sadness, surprise, trust, and joy. You have to clean the data by doing some pre-processing first. Then, apply feature engineering or any other data mining technique you have or haven't learned in the Data Mining course. The final goal is to learn a model that is able to predict the emotion behind each tweet. Note More detail about this assignment 2 is on iLMS (link) and GitHub (Lab2). Remember to fill in your team name Here . Acknowledgements We thank IDEA Lab for providing this dataset.`'",,Data Mining Lab2,inClass,Emotion Recognition on Twitter,meanfscore,data-mining-lab2 152,"'`DataQuest 2020 - Organised by PASC Gear up guys, you have a new problem statement to solve! :) The temperature and the moisture conditions in a factory outlet were calculated using a wireless sensor. The data that was observed was averaged over 10 minute time slots. The various temperatures calculated for the different rooms inside the factory are given. Two random variables were added in the data-set for testing the regression models and to filter out non-predictive parameters. Predict the energy consumed.`'",,PASC Data-Quest 2020,inClass,Take Me Higher!,rmse,pasc-data-quest-2020 153,"'`Plankton are critically important to our ecosystem, accounting for more than half the primary productivity on earth and nearly half the total carbon fixed in the global carbon cycle. They form the foundation of aquatic food webs including those of large, important fisheries. Loss of plankton populations could result in ecological upheaval as well as negative societal impacts, particularly in indigenous cultures and the developing world. Planktons global significance makes their population levels an ideal measure of the health of the worlds oceans and ecosystems. Traditional methods for measuring and monitoring plankton populations are time consuming and cannot scale to the granularity or scope necessary for large-scale studies. Improved approaches are needed. One such approach is through the use of an underwater imagery sensor. This towed, underwater camera system captures microscopic, high-resolution images over large study areas. The images can then be analyzed to assess species populations and distributions. Manual analysis of the imagery is infeasible it would take a year or more to manually analyze the imagery volume captured in a single day. Automated image classification using machine learning tools is an alternative to the manual approach. Analytics will allow analysis at speeds and scales previously thought impossible. The automated system will have broad applications for assessment of ocean and ecosystem health. The National Data Science Bowl challenges you to build an algorithm to automate the image identification process. Scientists at the Hatfield Marine Science Center and beyond will use the algorithms you create to study marine food webs, fisheries, ocean conservation, and more. This is your chance to contribute to the health of the worlds oceans, one plankton at a time. Acknowledgements The National Data Science Bowl is presented by with data provided by the Hatfield Marine Science Center at Oregon State University.`'",,National Data Science Bowl,featured,"Predict ocean health, one plankton at a time",MulticlassLossOld,national-data-science-bowl 154,"'`This is a parallel competition to DCASE2019 challenge task 1B, intended to serve as public leaderboard for participants during the DCASE challenge. The competition is open for everybody and there is no requirement to submit to DCASE challenge. This subtask is concerned with the situation in which an application will be tested with different devices, possibly not the same as the ones used to record the development data. In this case, evaluation data contains more devices than the development data. The subtask uses TAU Urban Acoustic Scenes 2019 Mobile dataset. The training material of this competition is the development dataset released for Task1B, and testing material consist of similar material to the official evaluation dataset in the DCASE challenge. The testing material amount in this competition is considerably lower than the official evaluation material in the DCASE challenge. There are three parallel competitions to DCASE challenge task 1 open in the Kaggle, one for each subtask: Task1A Leaderboard, Acoustic Scene Classification, single recording device used in the data collection, no external data allowed. Task1B Leaderboard, Acoustic Scene Classification with mismatched recording devices, multiple recording devices used in the data collection, no external data allowed. Task1C Leaderboard, Open set Acoustic Scene Classification , classification on data that includes classes not encountered in the training data. Note: Official DCASE submission is not done through Kaggle. See DCASE website for full information on how to submit to the challenge. Organizers Annamaria Mesaros, Assistant Professor, Tampere University, Finland Toni Heittola, Researcher, Tampere University, Finland Tuomas Virtanen, Professor, Tampere University, Finland Acoustic scene classification task The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded for example ""park"", ""pedestrian street"", ""metro station"". Audio dataset The dataset for this task is the TAU Urban Acoustic Scenes 2019 mobile dataset, consisting of recordings from various acoustic scenes. This dataset extends the TUT Urban Acoustic Scenes 2018 Mobile dataset with other 6 cities to a total of 12 large European cities. For each scene class, recordings were done in different locations; for each recording location there are 5-6 minutes of audio. The original recordings were split into segments with a length of 10 seconds that are provided in individual files. Available information about the recordings include the following: acoustic scene class, city, and recording location. Acoustic scenes (10): Airport - airport Indoor shopping mall - shopping_mall Metro station - metro_station Pedestrian street - street_pedestrian Public square - public_square Street with medium level of traffic - street_traffic Travelling by a tram - tram Travelling by a bus - bus Travelling by an underground metro - metro Urban park - park Data was recorded in the following cities: Amsterdam Barcelona Helsinki Lisbon London Lyon Madrid Milan Prague Paris Stockholm Vienna The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND. TAU Urban Acoustic Scenes 2019 Mobile, Development dataset contains material recorded with three devices (A, B and C). Each acoustic scene has 1440 segments (240 minutes of audio) recorded with device A (main device) and 108 segments of parallel audio (18 minutes) each recorded with devices B and C. The dataset contains in total 46 hours of audio. TAU Urban Acoustic Scenes 2019 Mobile, Development dataset (18.7 GB) TAU Urban Acoustic Scenes 2019 Mobile, Leaderboard dataset contains only material recorded with devices B and C, and it is used in this competition as test dataset. This dataset is used only in this competition, not in the actual DCASE2019 Challenge. TAU Urban Acoustic Scenes 2019 Mobile, Leaderboard dataset (1.3 GB) Recording procedure Recordings were made using four devices that captured audio simultaneously. The main recording device consists in Soundman OKM II Klassik/studio A3, electret binaural microphone and a Zoom F8 audio recorder using 48kHz sampling rate and 24 bit resolution. The microphones are specifically made to look like headphones, being worn in the ears. As an effect of this, the recorded audio is very similar to the sound that reaches the human auditory system of the person wearing the equipment. This equipment is further referred to as device A. The other devices are commonly available customer devices: device B is a Samsung Galaxy S7, device C is IPhone SE, and device D is a GoPro Hero5 Session. All simultaneous recordings are time synchronized. Baseline system The baseline system provides a simple entry-level approach that gives reasonable results in the subtasks of Task 1. The baseline system is built on dcase_util toolbox. DCASE2019 Task 1 Baseline code repository System description The baseline system implements a convolutional neural network (CNN) based approach, where log mel-band energies are first extracted for each 10-second signal, and a network consisting of two CNN layers and one fully connected layer is trained to assign scene labels to the audio signals. More detailed description and system accuracy on development dataset can be found from readme file in the code repository.`'",,DCASE2019 Challenge - Task1B Leaderboard,,Acoustic Scene Classification with mismatched recording devices,categorizationaccuracy,dcase2019-challenge-task1b-leaderboard 155,"'`This competition is closed for submissions. Participants' selected code submissions were re-run by the host on a privately-held test set and the private leaderboard results have been finalized. Late submissions will not be opened, due to an inability to replicate the unique design of this competition. Deepfake techniques, which present realistic AI-generated videos of people doing and saying fictional things, have the potential to have a significant impact on how people determine the legitimacy of information presented online. These content generation and modification technologies may affect the quality of public discourse and the safeguarding of human rightsespecially given that deepfakes may be used maliciously as a source of misinformation, manipulation, harassment, and persuasion. Identifying manipulated media is a technically demanding and rapidly evolving challenge that requires collaborations across the entire tech industry and beyond. AWS, Facebook, Microsoft, the Partnership on AIs Media Integrity Steering Committee, and academics have come together to build the Deepfake Detection Challenge (DFDC). The goal of the challenge is to spur researchers around the world to build innovative new technologies that can help detect deepfakes and manipulated media. Challenge participants must submit their code into a black box environment for testing. Participants will have the option to make their submission open or closed when accepting the prize. Open proposals will be eligible for challenge prizes as long as they abide by the open source licensing terms. Closed proposals will be proprietary and not be eligible to accept the prizes. Regardless of which track is chosen, all submissions will be evaluated in the same way. Results will be shown on the leaderboard. The PAI Steering Committee has emphasized the need to ensure that all technical efforts incorporate attention to how the resulting code and products based on it can be made as accessible and useful as possible to key frontline defenders of information quality such as journalists and civic leaders around the world. The DFDC results will be a contribution to this effort and building a robust response to the emergent threat deepfakes pose globally.`'",,Deepfake Detection Challenge,,Identify videos with facial or voice manipulations,LogLoss,deepfake-detection-challenge 156,"'` mail.ru : main_category . : . 28- . , . . : , , ; , , / ; , , ; , , ; . , , , , , . . mail.ru group : transfer learning; data augmentation; metric learning; pseudo labeling; multitask learning`'",,DeepNLP HSE Course,inClass,Millions of trash questions with billions of answers,meanfscore,deepnlp-hse-course 157,"'`This competition is provided as a way to explore different time series techniques on a relatively simple and clean dataset. You are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items at 10 different stores. What's the best way to deal with seasonality? Should stores be modeled separately, or can you pool them together? Does deep learning work better than ARIMA? Can either beat xgboost? This is a great competition to explore different models and improve your skills in forecasting.`'",,Store Item Demand Forecasting Challenge,,Predict 3 months of item sales at different stores ,SMAPE,store-item-demand-forecasting-challenge 158,"'`Optical Character Recognition (OCR) is the process of getting type or handwritten documents into a digitized format. If you've read a classic novel on a digital reading device or had your doctor pull up old healthcare records via the hospital computer system, you've probably benefited from OCR. OCR makes previously static content editable, searchable, and much easier to share. But, a lot of documents eager for digitization are being held back. Coffee stains, faded sun spots, dog-eared pages, and lot of wrinkles are keeping some printed documents offline and in the past. This competition challenges you to give these documents a machine learning makeover. Given a dataset of images of scanned text that has seen better days, you're challenged to remove the noise. Improving the ease of document enhancement will help us get that rare mathematics book on our e-reader before the next beach vacation. We've kicked off the fun with a few handy scripts to get you started on the dataset. Acknowledgements Kaggle is hosting this competition for the machine learning community to use for fun and practice. This dataset was created by RM.J. Castro-Bleda, S. Espaa-Boquera, J. Pastor-Pellicer, F. Zamora-Martinez. We also thank the UCI machine learning repository for hosting the dataset. If you use the problem in publication, please cite: Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science`'",,Denoising Dirty Documents,,Remove noise from printed text,RMSE,denoising-dirty-documents 159,'`Make your best model based on a fully-connected (dense) neural network architecture to classify items in the anonymized dataset.`',,Dense neural network for classification,,Build a dense neural network model to classify the dataset,meanfscore,dense-neural-network-for-classification 160,"'`How Does the Prize Work? The $10,000 prize pool will be split between the top two finishers for the classification challenge and the top visualization submission as follows: First place : $7,000 Second place: $2,500 Visualization Prospect: $500 But wait, there's more... Recruiting Competition Do you want a chance to build a crucial component of Web 2.0 infrastructure: the defense against spam and abuse? Do you live and breathe machine learning and data-mining? We're looking for someone whose passion lies in the invention and application of cutting-edge machine learning and data-mining techniques. Impermium is an engineering-driven startup dedicated to helping consumer web sites protect themselves from ""social spam,"" account hacking, bot attacks, and more. Impermium will review the top entries and offer interviews to the creators of those submissions which are exceptional. Due to visa issuance delays, Impermium is not able to sponsor new H1B applicants but are happy to support H1B transfers, permanent residents and US citizens. `'",,Detecting Insults in Social Commentary,recruitment,Predict whether a comment posted during a public discussion is considered insulting to one of the participants.,AUC,detecting-insults-in-social-commentary 161,"'`Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people. The US Center for Disease Control and Prevention estimates that 29.1 million people in the US have diabetes and the World Health Organization estimates that 347 million people have the disease worldwide. Diabetic Retinopathy (DR) is an eye disease associated with long-standing diabetes. Around 40% to 45% of Americans with diabetes have some stage of the disease. Progression to vision impairment can be slowed or averted if DR is detected in time, however this can be difficult as the disease often shows few symptoms until it is too late to provide effective treatment. Currently, detecting DR is a time-consuming and manual process that requires a trained clinician to examine and evaluate digital color fundus photographs of the retina. By the time human readers submit their reviews, often a day or two later, the delayed results lead to lost follow up, miscommunication, and delayed treatment. Clinicians can identify DR by the presence of lesions associated with the vascular abnormalities caused by the disease. While this approach is effective, its resource demands are high. The expertise and equipment required are often lacking in areas where the rate of diabetes in local populations is high and DR detection is most needed. As the number of individuals with diabetes continues to grow, the infrastructure needed to prevent blindness due to DR will become even more insufficient. The need for a comprehensive and automated method of DR screening has long been recognized, and previous efforts have made good progress using image classification, pattern recognition, and machine learning. With color fundus photography as input, the goal of this competition is to push an automated detection system to the limit of what is possible ideally resulting in models with realistic clinical potential. The winning models will be open sourced to maximize the impact such a model can have on improving DR detection. Acknowledgements This competition is sponsored by the California Healthcare Foundation. Retinal images were provided by EyePACS, a free platform for retinopathy screening.`'",,Diabetic Retinopathy Detection,featured,Identify signs of diabetic retinopathy in eye images,QuadraticWeightedKappa,diabetic-retinopathy-detection 162,"'`The goal of this competition is the prediction of the price of diamonds based on their characteristics (weight, color, quality of cut, etc.), putting into practice all the machine learning techniques you know.`'",,Diamonds | datamad0620,inClass,Predicting diamonds prices,rmse,diamonds-|-datamad0620 163,"'`Hai Data Analyst/Scientist, Kompetisi ini dibuat untuk latihan bagi kamu yang ingin mencoba dan mengasah kemampuan analisis dan modeling yang kamu miliki. Objektif dari kompetisi ini sangat sederhana: buat analisis dan model machine learning untuk menduga harga berlian. The challenge Sudah menjadi rahasia umum bahwa berlian adalah benda berharga yang harganya cukup atau bahkan sangat mahal. Setiap berlian mempunyai karakteristik tertentu mulai dari ukuran, tingkat kehalusan, dan sebgaianya. Data yang digunakan adalah data diamonds dari package R ggplot2. Acknowledgements Terima kasih kepada Hadley Wickham and team yang telah membuat package visulisasi yang luar biasa dan menyediakan data ini.`'",,Diamonds Price,,Predict the price of diamonds,rmse,diamonds-price 164,"'`This is the home page of the competition. In this competition you must analyze bank data from current accounts and identify insolvent individuals. You have 60,000 training units available to estimate the model, which you must then use to estimate the probability of default in 16020 units, of which you do not have the target..`'",,Fraud detection,inClass,Classify bank transactions,auc,fraud-detection 165,"'`Start here if... You have some experience with R or Python and machine learning basics, but youre new to computer vision. This competition is the perfect introduction to techniques like neural networks using a classic dataset including pre-extracted features. Competition Description MNIST (""Modified National Institute of Standards and Technology"") is the de facto hello world dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike. In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. Weve curated a set of tutorial-style kernels which cover everything from regression to neural networks. We encourage you to experiment with different algorithms to learn first-hand what works well and how techniques compare. Practice Skills Computer vision fundamentals including simple neural networks Classification methods such as SVM and K-nearest neighbors Acknowledgements More details about the dataset, including algorithms that have been tried on it and their levels of success, can be found at http://yann.lecun.com/exdb/mnist/index.html. The dataset is made available under a Creative Commons Attribution-Share Alike 3.0 license.`'",,Digit Recognizer,,Learn computer vision fundamentals with the famous MNIST data,categorizationaccuracy,digit-recognizer 166,"'`This is the Phase 3 challenge from the Hackathon of the 3rd Deep Learning and AI Summer/Winter School (DLAI3). More details about the previous challenge can be found at https://www.kaggle.com/c/dlai3/. Medical image analysis has continually been an area of prominent and growing importance, in both the research and application aspects. However, it is important to realize that the algorithms, methodology, as well as the source of data, need to be strictly scrutinized. Phase 3 is a challenge that provides a mixed dataset of publicly available COVID-19 chest x-ray images from various sources, a pneumonia dataset obtained from a study on children, partially verified thorax CXR images of adults from NIH, and unverified no-finding images from NIH. See below for the sources of these data. Goal The goal is to provide the participants a chance to test their skills to come up with a good performance model in turns of the macro F1 score to classify COVID-19 chest x-ray images. The Phase 3 DLAI3 Hackathon will award the top three entries that have a full paper submission to the 11th International Conference on Computational Systems-Biology and Bioinformatics (CSBio2020), under the theme of CHAT-2020 (COVID-19 Health, Analytics, and Technologies). The top three individuals/teams will be provided with complimentary registration for their full paper submission to the ACM ICPS Proceedings of CSBio2020 and will be invited to submit an extended work to a post-conference journal special issue. Dataset citation Please cite the kaggle dataset as follows: Jonathan H. Chan, DLAI3 Hackathon Phase3 COVID-19 CXR Challenge. Kaggle, doi: 10.34740/KAGGLE/DSV/1347344. Acknowledgments The following are the sources and related publications for this dataset: 1) Pneumonia CXR images of children retrieved from Mendeley at: https://data.mendeley.com/datasets/rscbjbr9sj/2 Related publication: https://www.cell.com/cell/fulltext/S0092-8674(18)30154-5 2) Cropped COVID-19 + selected images from the following (retrieved July 17, 2020): https://github.com/agchung/Actualmed-COVID-chestxray-dataset Related publication: https://arxiv.org/abs/2003.09871 v4 Mon, 11 May 2020 3) COVID-19 chest x-ray images from: https://github.com/ieee8023/covid-chestxray-dataset (Retrieved July 13, 2020) Images: https://github.com/ieee8023/covid-chestxray-dataset/blob/master/images Metadata: https://github.com/ieee8023/covid-chestxray-dataset/blob/master/metadata.csv 4) Selected images from: https://nihcc.app.box.com/v/ChestXray-NIHCC Publication: ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases https://arxiv.org/abs/1705.02315 [v5] Thu, 14 Dec 2017 Semi-verified by: https://lukeoakdenrayner.wordpress.com/2017/12/18/the-chestxray14-dataset-problems/comment-page-1/`'",,DLAI3 Hackathon Phase 3,inClass,Multi-class COVID-19 Chest x-ray challenge,macrofscore,dlai3-hackathon-phase-3 167,"'`Who's a good dog? Who likes ear scratches? Well, it seems those fancy deep neural networks don't have all the answers. However, maybe they can answer that ubiquitous question we all ask when meeting a four-legged stranger: what kind of good pup is that? In this playground competition, you are provided a strictly canine subset of ImageNet in order to practice fine-grained image categorization. How well you can tell your Norfolk Terriers from your Norwich Terriers? With 120 breeds of dogs and a limited number training images per class, you might find the problem more, err, ruff than you anticipated. Acknowledgments We extend our gratitude to the creators of the Stanford Dogs Dataset for making this competition possible: Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li.`'",,Dog Breed Identification,,Determine the breed of a dog in an image,MulticlassLoss,dog-breed-identification 168,"'`In this competition, you'll write an algorithm to classify whether images contain either a dog or a cat. This is easy for humans, dogs, and cats. Your computer will find it a bit more difficult. Deep Blue beat Kasparov at chess in 1997. Watson beat the brightest trivia minds at Jeopardy in 2011. Can you tell Fido from Mittens in 2013? The Asirra data set Web services are often protected with a challenge that's supposed to be easy for people to solve, but difficult for computers. Such a challenge is often called a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) or HIP (Human Interactive Proof). HIPs are used for many purposes, such as to reduce email and blog spam and prevent brute-force attacks on web site passwords. Asirra (Animal Species Image Recognition for Restricting Access) is a HIP that works by asking users to identify photographs of cats and dogs. This task is difficult for computers, but studies have shown that people can accomplish it quickly and accurately. Many even think it's fun! Here is an example of the Asirra interface: Asirra is unique because of its partnership with Petfinder.com, the world's largest site devoted to finding homes for homeless pets. They've provided Microsoft Research with over three million images of cats and dogs, manually classified by people at thousands of animal shelters across the United States. Kaggle is fortunate to offer a subset of this data for fun and research. Image recognition attacks While random guessing is the easiest form of attack, various forms of image recognition can allow an attacker to make guesses that are better than random. There is enormous diversity in the photo database (a wide variety of backgrounds, angles, poses, lighting, etc.), making accurate automatic classification difficult. In an informal poll conducted many years ago, computer vision experts posited that a classifier with better than 60% accuracy would be difficult without a major advance in the state of the art. For reference, a 60% classifier improves the guessing probability of a 12-image HIP from 1/4096 to 1/459. State of the art The current literature suggests machine classifiers can score above 80% accuracy on this task [1]. Therfore, Asirra is no longer considered safe from attack. We have created this contest to benchmark the latest computer vision and deep learning approaches to this problem. Can you crack the CAPTCHA? Can you improve the state of the art? Can you create lasting peace between cats and dogs? Okay, we'll settle for the former. Acknowledgements We extend our thanks to Microsoft Research for providing the data for this competition.`'",,Dogs vs. Cats,playground,Create an algorithm to distinguish dogs from cats,CategorizationAccuracy,dogs-vs.-cats 169,"'`Founded in 2000 by a high school teacher in the Bronx, DonorsChoose.org empowers public school teachers from across the country to request much-needed materials and experiences for their students. At any given time, there are thousands of classroom requests that can be brought to life with a gift of any amount. DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website. Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve: How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible How to increase the consistency of project vetting across different volunteers to improve the experience for teachers How to focus volunteer time on the applications that need the most assistance The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval. With an algorithm to pre-screen applications, DonorsChoose.org can auto-approve some applications quickly so that volunteers can spend their time on more nuanced and detailed project vetting processes, including doing more to help teachers develop projects that qualify for specific funding opportunities. Your machine learning algorithm can help more teachers get funded more quickly, and with less cost to DonorsChoose.org, allowing them to channel even more funding directly to classrooms across the country. Getting Started with Kernels Get familiar with the competition data and the machine learning objective quickly using Kernels. Google's engineering education team has put together a starter tutorial implementing benchmark linear classification model. Acknowledgments Machine Learning Crash Course was created by Google's engineering education team in partnership with numerous Machine Learning subject matter experts across Google.`'",,DonorsChoose.org Application Screening,,Predict whether teachers' project proposals are accepted,AUC,donorschoose.org-application-screening 170,"'`Long ago, in the distant, fragrant mists of time, there was a competition It was not just any competition. It was a competition that challenged mere mortals to model a 20,000x200 matrix of continuous variables using only 250 training samples without overfitting. Data scientists including Kaggle's very own Will Cukierski competed by the hundreds. Legends were made. (Will took 5th place, and eventually ended up working at Kaggle!) People overfit like crazy. It was a Kaggle-y, data science-y madhouse. So we're doing it again. Don't Overfit II: The Overfittening This is the next logical step in the evolution of weird competitions. Once again we have 20,000 rows of continuous variables, and a mere handful of training samples. Once again, we challenge you not to overfit. Do your best, model without overfitting, and add, perhaps, to your own legend. In addition to bragging rights, the winner also gets swag. Enjoy! Acknowledgments We hereby salute the hard work that went into the original competition, created by Phil Brierly. Thank you!`'",,Don't Overfit! II,,A Fistful of Samples,AUC,dont-overfit!-ii 171,"'`Imagine a world where we can use satellite images to help find better access to clean water, prevent poaching of wildlife, predict storms more efficiently, optimize traffic patterns more readily, and inform human behaviors to mitigate the spread of disease. Thanks to a marked increase of satellites in orbit, we will be able to capture images and the information contained within of nearly every place on Earth, every day by 2017. However, our ability to analyze datasets of these images has not advanced as quickly. Changes from day to day in images of the same location are subtle, can be hard to detect, and are difficult to understand in terms of their significance. In this competition, Draper provides a unique dataset of images taken at the same locations over 5 days. Kagglers are challenged to predict the chronological order of the photos taken at each location. Accurately doing so could uncover approaches that have a global impact on commerce, science, and humanitarian works.`'",,Draper Satellite Image Chronology,,Can you put order to space and time? ,MASpearmanR,draper-satellite-image-chronology 172,'`Accuracy`',,Dromosys Movie Review,,Sentiment classification on movie reviews,categorizationaccuracy,dromosys-movie-review 173,"'`PROJECT OVERVIEW Develop a methodology to calculate an average historical emissions factor of electricity generated for a sub-national region, using remote sensing data and techniques. The Environmental Insights Explorer team at Google is keen to gather insights on ways to improve calculations of global emissions factors for sub-national regions. The ultimate goal of this challenge is to test if calculations of emissions factors using remote sensing techniques are possible and on par with calculations of emissions factors from current methodologies. PROBLEM STATEMENT Current emissions factors methodologies are based on time-consuming data collection and may include errors derived from a lack of access to granular datasets, inability to refresh data on a frequent basis, overly general modeling assumptions, and inaccurate reporting of emissions sources like fuel consumption. This begs the question: What if there was a different way to calculate or measure emissions factors? Were challenging the Kaggle community to see if its possible to use remote sensing techniques to better model emissions factors. You will develop a methodology to calculate an average historical emissions factor for electricity generation in a sub-national region. Weve provided an initial list of datasets covering the geographic boundary of Puerto Rico to serve as the foundation for this analysis. As an island, there are fewer confounding factors from nearby areas. Puerto Rico also offers a unique fuel mix and distinctive energy system layout that should make it easier to isolate pollution attributable to power generation in the remote sensing data. Participants will be tasked with developing a methodology to calculate an average annual historical emissions factor for the sub-national region. Participants will also be asked to provide an explanation of the conditions that would result in a higher/lower emissions factor, as well as a recommendation for how the methodology could be applied to calculate the emissions factor of electricity for another geospatial area using similar techniques. Bonus points will be awarded for smaller time slices of the average historical emissions factors, such as one per month for the 12-month period, and additional bonus points will be awarded for participants that develop methodologies for calculating marginal emissions factors for the sub-national region. HOW TO PARTICIPATE To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will only review the most recent entry. To be valid, a submission must be contained in one or more notebook, and made public on or before the submission deadline. Participants are free to use any datasets in addition to the official Kaggle dataset, but those datasets must also be publicly available on either Earth Engine or Kaggle for the submission to be valid.`'",,DS4G - Environmental Insights Explorer,,Exploring alternatives for emissions factor calculations,pollution,ds4g-environmental-insights-explorer 174,"'`CityMov, is a well-known firm in the Automotive industry with a continuous drive to addressing traffic congestion. This drive led them to the manufacturing of a mini scooter and then wonder if their customers are likely to buy this new solution despite having other means of transportation. The first release of the Scooter called (NeverLate) recorded much success and since then look forward to releasing the improved version. You have been appointed as the Lead Data Scientist to build a predictive model to determine if a customer will buy this product or not. The model will be based on the customer characteristics. The target variable, BuyScooter, is a: 1 if the customer buys the scooter. 0 if the customer didnt.`'",,Taking the first Leap,inClass,"Scooter, the traffic solution",categorizationaccuracy,taking-the-first-leap 175,"'`The proliferation of satellite imagery has given us a radically improved understanding of our planet. It has enabled us to better achieve everything from mobilizing resources during disasters to monitoring effects of global warming. What is often taken for granted is that advancements such as these have relied on labeling features of significance like building footprints and roadways fully by hand or through imperfect semi-automated methods. As these large, complex datasets continue to increase exponentially in number, the Defence Science and Technology Laboratory (Dstl) is seeking novel solutions to alleviate the burden on their image analysts. In this competition, Kagglers are challenged to accurately classify features in overhead imagery. Automating feature labeling will not only help Dstl make smart decisions more quickly around the defense and security of the UK, but also bring innovation to computer vision methodologies applied to satellite imagery.`'",,Dstl Satellite Imagery Feature Detection,,Can you train an eye in the sky?,JaccardDSTLParallel,dstl-satellite-imagery-feature-detection 176,"'`Looking to learn about Machine Learning? Join in for a hands-on experience with EEE Datathon Challenge in collaboration with MLDA@EEE! Code your own data model using skills picked up from MLDA's very own tutorial videos! Sign up as an individual or a team of not more than 3 via https://tinyurl.com/NTUEEEDatathon! Date: 9 May - 23 May 2020, 2359 Submission Deadline: 05/23/2020 11:59 PM SGT How to participate 1. Follow the NTU EEE Instagram page @ntueee! 2. Complete watching the series of the 5 introductory videos on MLDA@EEE youtube channel to get you started on your very own machine learning project 3. Upgrade your skills with 3 bonus videos on MLDA@EEE's youtube channel covering more advanced topics like CV, NLP and CNN! Watch them ALL in a playlist: https://www.youtube.com/watch?v=FgTpL-8gouE&list=PLXpIV63PRtvV0i00XYqfj2_mOjp8kt4ti On 15th May: datathon challenge dataset instructional video will be released! scoring board on Kaggle will also be updated daily Stay tuned for updates on @ntueee on Instagram! Watch them ALL in a playlist: https://www.youtube.com/watch?v=FgTpL-8gouE&list=PLXpIV63PRtvV0i00XYqfj2_mOjp8kt4ti Start your coding: 1. Find ""Example notebook"" created by MLDA@EEE under Notebooks. 2. Click ""Copy and Edit"". 3. Rename and work on your notebook! 4. Name and submit your submission file! Instruction video for a quick start: https://www.youtube.com/watch?v=7w7n0DGsOaM You may also work with Jupyter lab on a local computer. Submit your file onto Kaggle from 15 May to 23 May, 2359 to be evaluated by our testing data! Protip: You are allowed to submit your csv file onto Kaggle as many times as you'd like to get the highest accuracy results! The top 3 teams with the highest accuracy scores will be crowned winners of EEE Datathon on 24 May! For more resources & tips, head over to our Induction Fiesta telegram channel: https://t.me/inductionfiesta2020 Organised by MLDA@EEE and Induction Fiesta EEE`'",,EEE Datathon Challenge 2020,,A datathon challenge organised by NTU EEE to hone your skill in data science and machine learning,r2score,eee-datathon-challenge-2020 177,"'`Imagine being hungry in an unfamiliar part of town and getting restaurant recommendations served up, based on your personal preferences, at just the right moment. The recommendation comes with an attached discount from your credit card provider for a local place around the corner! Right now, Elo, one of the largest payment brands in Brazil, has built partnerships with merchants in order to offer promotions or discounts to cardholders. But do these promotions work for either the consumer or the merchant? Do customers enjoy their experience? Do merchants see repeat business? Personalization is key. Elo has built machine learning models to understand the most important aspects and preferences in their customers lifecycle, from food to shopping. But so far none of them is specifically tailored for an individual or profile. This is where you come in. In this competition, Kagglers will develop algorithms to identify and serve the most relevant opportunities to individuals, by uncovering signal in customer loyalty. Your input will improve customers lives and help Elo reduce unwanted campaigns, to create the right experience for customers.`'",,Elo Merchant Category Recommendation,,Help understand customer loyalty,RMSE,elo-merchant-category-recommendation 178,"'` , . , submission-. Submission- - .csv - id . jupyter- . -, 4848. 28709 , - 7178. . , submission-, . , baseline. , , strong-baseline , `'",,Facial Emotion Recognition,inClass,Classify the emotion associated with the facial expression using CNN,meanfscore,facial-emotion-recognition 179,"'`We (the competition hosts) are excited to sponsor the Event Recommendation Engine Challenge, which asks you to predict what events our users will be interested in based on events theyve responded to in the past, user demographic information, and what events theyve seen and clicked on in our app. The insights you discover from this data, and the algorithms the winners create, will allow us to improve our event recommendation algorithm, a core part of our applications and a key element in improving user experience. This is the first competition launching under the Kaggle Startup Program!`'",,Event Recommendation Engine Challenge,featured,"Predict what events our users will be interested in based on user actions, event metadata, and demographic information.",MAP@k,event-recommendation-engine-challenge 180,"'`Planning your dream vacation, or even a weekend escape, can be an overwhelming affair. With hundreds, even thousands, of hotels to choose from at every destination, it's difficult to know which will suit your personal preferences. Should you go with an old standby with those pillow mints you like, or risk a new hotel with a trendy pool bar? Expedia wants to take the proverbial rabbit hole out of hotel search by providing personalized hotel recommendations to their users. This is no small task for a site with hundreds of millions of visitors every month! Currently, Expedia uses search parameters to adjust their hotel recommendations, but there aren't enough customer specific data to personalize them for each user. In this competition, Expedia is challenging Kagglers to contextualize customer data and predict the likelihood a user will stay at 100 different hotel groups. The data in this competition is a random selection from Expedia and is not representative of the overall statistics. `'",,Expedia Hotel Recommendations,,Which hotel type will an Expedia customer book?,MAP@{K},expedia-hotel-recommendations 181,"'`Expedia is the worlds largest online travel agency (OTA) and powers search results for millions of travel shoppers every day. In this competitive market matching users to hotel inventory is very important since users easily jump from website to website. As such, having the best ranking of hotels (sort) for specific users with the best integration of price competitiveness gives an OTA the best chance of winning the sale. For this contest, Expedia has provided a dataset that includes shopping and purchase data as well as information on price competitiveness. The data are organized around a set of search result impressions, or the ordered list of hotels that the user sees after they search for a hotel on the Expedia website. In addition to impressions from the existing algorithm, the data contain impressions where the hotels were randomly sorted, to avoid the position bias of the existing algorithm. The user response is provided as a click on a hotel or/and a purchase of a hotel room. Appended to impressions are the following: 1) Hotel characteristics 2) Location attractiveness of hotels 3) Users aggregate purchase history 4) Competitive OTA information Models will be scored via performance on a hold-out set.`'",,Personalize Expedia Hotel Searches - ICDM 2013,featured,Learning to rank hotels to maximize purchases,NDCG@{K},personalize-expedia-hotel-searches-icdm-2013 182,"'`The goal of this competition is to develop a machine learning model for explicit content detection on a web page. In particular, it is proposed to develop a model for antiporn filter. Each web page is described by its url and title. The goal is to determine its content based on its description. Your final score for this competition will be estimated as follows: 1. Score = (""your quality""-""baseline method quality"") / (""max achieved quality"" - ""baseline method quality"") 2. Notebook with your best solution must be reproducible. Otherwise, you will not get any score.`'",,Explicit content detection,inClass,HSE Data analysis (Software Engineering) 2020,f_{beta},explicit-content-detection 183,"'`Looking for a data science position at Facebook? After two successful prior Kaggle competitions, Facebook continues their mission to identify the best data scientists and software engineers that Kaggle has to offer. In this third installment, they seek candidates who have experience text mining large amounts of data. This competition tests your text skills on a large dataset from the Stack Exchange sites. The task is to predict the tags (a.k.a. keywords, topics, summaries), given only the question text and its title. The dataset contains content from disparate stack exchange sites, containing a mix of both technical and non-technical questions. Positions are available in Menlo Park, Seattle, New York City, and London; candidates must have, or be eligible to obtain, authorization to work in the US or UK. Please note: you must compete as an individual in recruiting competitions. You may only use the data provided to make your predictions. Crawling stack exchange sites to look up answers is not permitted. Facebook will review the code of the top participants before deciding whether to offer an interview. This competition counts towards rankings & achievements. If you wish to be considered for an interview at Facebook, check the box ""Allow host to contact me"" when you make your first entry. Acknowledgements We thank Stack Exchange (and its users) for generously releasing the source dataset through its Creative Commons Data Dumps. All data is licensed under the cc-by-sa license.`'",,Facebook Recruiting III - Keyword Extraction,recruitment,Identify keywords and tags from millions of text questions,MeanFScore,facebook-recruiting-iii-keyword-extraction 184,"'`Ever wonder what it's like to work at Facebook? Facebook and Kaggle are launching a machine learning engineering competition for 2016. Trail blaze your way to the top of the leaderboard to earn an opportunity at interviewing for one of the 10+ open roles as a software engineer, working on world class machine learning problems. The goal of this competition is to predict which place a person would like to check in to. For the purposes of this competition, Facebook created an artificial world consisting of more than 100,000 places located in a 10 km by 10 km square. For a given set of coordinates, your task is to return a ranked list of the most likely places. Data was fabricated to resemble location signals coming from mobile devices, giving you a flavor of what it takes to work with real data complicated by inaccurate and noisy values. Inconsistent and erroneous location data can disrupt experience for services like Facebook Check In. We highly encourage competitors to be active on Kaggle Scripts. Your work there will be thoughtfully included in the decision making process. Please note: You must compete as an individual in recruiting competitions. You may only use the data provided to make your predictions.`'",,Facebook V: Predicting Check Ins,,Identify the correct place for check ins,MAP@{K},facebook-v:-predicting-check-ins 185,'`Develop a machine learning program to identify when an article might be fake news. Run by the UTK Machine Learning Club.`',,Fake News,inClass,Build a system to identify unreliable news articles,categorizationaccuracy,fake-news 186,"'`We are hosting a fake news detection shared task for the Second International TrueFact Workshop: Making a Credible Web for Tomorrow in conjunction with SIGKDD 2020. In this shared task, the participants will design a system to distinguish fake claims from authentic ones. Acknowledgements We thank Kai Shu, Arizona State University for providing this challenging dataset and co-organize this task.`'",,Fake News Detection Challenge KDD 2020,inClass,Develop a machine learning algorithm to detect fake news,categorizationaccuracy,fake-news-detection-challenge-kdd-2020 187,"'`Problem description A journal needs to catalog all your news in different categories. The objective of this competition is to develop the best deep learning model to predict the category of new news. The possible categories are: ambiente equilibrioesaude sobretudo educacao ciencia tec turismo empreendedorsocial comida`'",,FASAM - NLP Competition - Turma 4,inClass,Predict News Category,meanfscore,fasam-nlp-competition-turma-4 188,"'`The Model Domain is hosting a Model Competition/Training using Kaggle. You are asked to build a neural network model to predict fraudulent credit card transactions. The winner will be selected based on the models performance on an out-of-sample test dataset. Objective: The goal of this competition is to gain additional practice building neural networks. Scoring Metric: Given the class imbalance ratio, submissions will be evaluated using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification. Deadline: Entries can be submitted at any time, and you can track your progress on the public leaderboard. We encourage you to make multiple submissions to try to improve your score. Final scoring using the private leaderboard will be conducted on June 30, 2020.`'",,FI 2020 Q2 Kaggle Competition!,,Use Neural Networks to identify fraudulent credit card transactions,meanfscore,fi-2020-q2-kaggle-competition! 189,'`Previso de vendas de babas eletronicas`',,FIAP FSBDS 21 - Baby Monitor Forecast,,This competition is running for FIAP Big Data Science course's students. Group 21th,rmse,fiap-fsbds-21-baby-monitor-forecast 190,"'`Evaluation: The evaluation metric for this competition is based on R-square of your predicted Wage vs the real value of Wage on every submission you made compared to the Professor or TA's models. You must check the validation of your model to get your real rank. Invalid models are not accepted in the final grading scheme. Submission Format For every author in the dataset, submission files should contain two columns: Ob and WageNew. Submission files should be a csv file format. The file should contain a header and have the following format: Ob, WageNew 1, 1230 8, 5420 9, 2015 10, 3165 etc.`'",,FIFA 2019 PLAYERS' WAGES,inClass,"18k+ Latest FIFA Players, with ~80 Attributes Extracted from FIFA database",r2score,fifa-2019-players-wages 191,'`week or two`',,fight bias part 1,,lets fight bias!,auc,fight-bias-part-1 192,"'`Majumdar, Khanna and Rawat have been invited to the 20th Annual Global Entrepreneurship Summit in Los Angeles, because of their exceptional skills in the field of Machine Learning. The investors in the summit are curious to know about the various incubating startups that are participating in the summit, and are also interested in funding them, if they are impressed by their work. The organisation team of the event has the data of all the companies that participated in the previous editions and also some outsourced data about other startups. They are willing to share the data with our three friends, provided they help investors in deciding whether they should invest in a particular startup or not. Can you help our three friends in their task?`'",,Find the fund,inClass,Help developers to find whether a company should be funded or not.,categorizationaccuracy,find-the-fund 193,"'`Prerequisites Create a new Kaggle account (if you don't already have one). Go to the Team tab and set your Team Name as your UMD directory ID (all lowercase, not your number ID, not your full name), e.g. hh2. If you've set this up correctly, your submission result will be publicly shown as your UMD directory ID. Goal It is your job to predict if a customer will make a payment for the billed amount next month or not. For each ID in the test set, you must predict a 1 or 0 value for the PAIDNEXTMONTH variable. Whats the difference between a private and public leaderboard? The Kaggle leaderboard has a public and private component to prevent participants from overfitting to the leaderboard. If your model is overfit to a dataset then it is not generalizable outside of the dataset you trained it on. This means that your model would have low accuracy on another sample of data taken from a similar dataset. Public Leaderboard For all participants, the same 50% of predictions from the test set are assigned to the public leaderboard. The score you see on the public leaderboard reflects your models accuracy on this portion of the test set. If you overfit the test set, you can do very well on the public leaderboard and very bad on the private leaderboard. Private Leaderboard The other 50% of predictions from the test set are assigned to the private leaderboard. The private leaderboard is not visible to participants until the competition has concluded. At the end of a competition, we will reveal the private leaderboard so you can see your score on the other 50% of the test data. The scores on the private leaderboard are used to determine the competition winners. Getting Started You may explore and run your machine learning code with Kernels, a cloud computing environment that enables reproducible and collaborative analysis. More info: Kernels Documentation Submission Submit in the same format as sample submission, you can download an example submission file (sample-submission.csv) on the Data page. Make sure the ID match your prediction. Your submission will show an error if you have extra columns (beyond ID and PAIDNEXTMONTH) or rows. Set your Team Name = your UMD Directory ID (not the numbers, not your full name). Submission Deadline: Mar 27 2019, 2:00pm EST`'",,UMD FIRE171 ASN4 Data Classification Challenge,inClass,Predicting if a customer will make a payment for the billed amount next month.,meanfscore,umd-fire171-asn4-data-classification-challenge 194,"'`Tensor Processing Units (TPUs) are Now Available on Kaggle Tensor Processing Unit (TPU) quotas are now available on Kaggle, at no cost to you! TPUs are powerful hardware accelerators specialized in deep learning tasks. They were developed (and first used) by Google to process large image databases, such as extracting all the text from Street View. This competition is designed for you to give TPUs a try. The latest Tensorflow release (TF 2.1) was focused on TPUs and theyre now supported both through the Keras high-level API and at a lower level, in models using a custom training loop. We cant wait to see how your solutions are accelerated by TPUs! The Challenge Its difficult to fathom just how vast and diverse our natural world is. There are over 5,000 species of mammals, 10,000 species of birds, 30,000 species of fish and astonishingly, over 400,000 different types of flowers. In this competition, youre challenged to build a machine learning model that identifies the type of flowers in a dataset of images (for simplicity, were sticking to just over 100 types). To get started with TPUs: Read the TPU documentation one-pager Then jump right into the Getting Started Notebook for this competition Quick note: a TPU is a network-connected accelerator and requires a couple extra lines in your code. Flipping the TPU switch in your notebook will not, by itself, accelerate your code. Have Questions? Martin Grner, Google Developer Advocate and author of Tensorflow without a PhD will be actively engaged in the competition forum. If you have a question or need help troubleshooting, thats the best place to find help.`'",,Flower Classification with TPUs,,"Use TPUs to classify 104 types of flowers ",MacroFScore,flower-classification-with-tpus 195,"'`Random forests? Cover trees? Not so fast, computer nerds. We're talking about the real thing. In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from cartographic variables. The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices. This competition originally ran in 2015. We are relaunching it as a kernels-only version here. Acknowledgements Kaggle is hosting this competition for the machine learning community to use for fun and practice. This dataset was provided by Jock A. Blackard and Colorado State University. We also thank the UCI machine learning repository for hosting the dataset. If you use the problem in publication, please cite: Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science`'",,Forest Cover Type (Kernels Only),,Use cartographic variables to classify forest categories,CategorizationAccuracy,forest-cover-type-(kernels-only) 196,"'`Some sounds are distinct and instantly recognizable, like a babys laugh or the strum of a guitar. Other sounds arent clear and are difficult to pinpoint. If you close your eyes, can you tell which of the sounds below is a chainsaw versus a blender? Moreover, we often experience a mix of sounds that create an ambience like the clamoring of construction, a hum of traffic from outside the door, blended with loud laughter from the room, and the ticking of the clock on your wall. The sound clip below is of a busy food court in the UK. Partly because of the vastness of sounds we experience, no reliable automatic general-purpose audio tagging systems exist. Currently, a lot of manual effort is required for tasks like annotating sound collections and providing captions for non-speech events in audiovisual content. To tackle this problem, Freesound (an initiative by MTG-UPF that maintains a collaborative database with over 370,000 Creative Commons Licensed sounds) and Google Researchs Machine Perception Team (creators of AudioSet, a large-scale dataset of manually annotated audio events with over 500 classes) have teamed up to develop the dataset for this competition. Youre challenged to build a general-purpose automatic audio tagging system using a dataset of audio files covering a wide range of real-world environments. Sounds in the dataset include things like musical instruments, human sounds, domestic sounds, and animals from Freesounds library, annotated using a vocabulary of more than 40 labels from Googles AudioSet ontology. To succeed in this competition your systems will need to be able to recognize an increased number of sound events of very diverse nature, and to leverage subsets of training data featuring annotations of varying reliability (see Data section for more information).`'",,Freesound General-Purpose Audio Tagging Challenge,,Can you automatically recognize sounds from a wide range of real-world environments?,MAP@{K},freesound-general-purpose-audio-tagging-challenge 197,"'`One year ago, Freesound and Googles Machine Perception hosted an audio tagging competition challenging Kagglers to build a general-purpose auto tagging system. This year theyre back and taking the challenge to the next level with multi-label audio tagging, doubled number of audio categories, and a noisier than ever training set. If you like raising your ML game, this challenge is for you. Here's the background: Some sounds are distinct and instantly recognizable, like a babys laugh or the strum of a guitar. Other sounds are difficult to pinpoint. If you close your eyes, could you tell the difference between the sound of a chainsaw and the sound of a blender? Because of the vastness of sounds we experience, no reliable automatic general-purpose audio tagging systems exist. A significant amount of manual effort goes into tasks like annotating sound collections and providing captions for non-speech events in audiovisual content. To tackle this problem, Freesound (an initiative by MTG-UPF that maintains a collaborative database with over 400,000 Creative Commons Licensed sounds) and Google Researchs Machine Perception Team (creators of AudioSet, a large-scale dataset of manually annotated audio events with over 500 classes) have teamed up to develop the dataset for this new competition. To win this competition, Kagglers will develop an algorithm to tag audio data automatically using a diverse vocabulary of 80 categories. If successful, your systems could be used for several applications, ranging from automatic labelling of sound collections to the development of systems that automatically tag video content or recognize sound events happening in real time. Ready to raise your game? Join the competition! Note, this competition is similar in nature to this competition with a new dataset, and multi-class labels. Organizers Eduardo Fonseca, MTG-UPF, Barcelona Manoj Plakal, Google's Sound Understanding, New York Frederic Font, MTG-UPF, Barcelona Dan Ellis, Google's Sound Understanding, New York This is a Kernels-only competition. Refer to Kernels Requirements for details.`'",,Freesound Audio Tagging 2019,,Automatically recognize sounds and apply tags of varying natures,WeightedLabelRankingAveragePrecision,freesound-audio-tagging-2019 198,"'`Variables description: 1 - State: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ 2 - Account Length: the number of days that this account has been active 3 - Area Code: the three-digit area code of the corresponding customers phone number 4 - Phone: the remaining seven-digit phone number 5 - Intl Plan: whether the customer has an international calling plan: yes/no 6 - VMail Plan: whether the customer has a voice mail feature: yes/no 7 - VMail Message: presumably the average number of voice mail messages per month 8 - Day Mins: the total number of calling minutes used during the day 9 - Day Calls: the total number of calls placed during the day 10 - Day Charge: the billed cost of daytime calls 11 - Eve Mins, Eve Calls, Eve Charge: the billed cost for calls placed during the evening 12 - Night Mins, Night Calls, Night Charge: the billed cost for calls placed during nighttime 13 - Intl Mins, Intl Calls, Intl Charge: the billed cost for international calls 14 - CustServ Calls: the number of calls placed to Customer Service Output variable (target): 15 - fuga: whether the customer left the service (1: Fugo/ 0: No fugo) Goog luck!`'",,Fuga de Clientes,inClass,Predicción de potenciales clientes a irse de la compañía,auc,fuga-de-clientes 199,"'`This 2018 Fungi Classification is an FGVCx competition as part of the FGVC5 workshop at CVPR 2018. Our sponsor, the Danish Svampe Atlas, has provided a dataset from a carefully curated database containing over 100,000 fungi images. The Svampe Atlas has a comprehensive representation of nearly 1,500 wild mushrooms species which have been spotted and photographed by the general public in Denmark.`'",,2018 FGCVx Fungi Classification Challenge,inClass,"Fine-grained classification challenge spanning 1,400 species of fungi.",meanbesterroratk,2018-fgcvx-fungi-classification-challenge 200,"'`The 80/20 rule has proven true for many businessesonly a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies. RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have. In this competition, youre challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.`'",,Google Analytics Customer Revenue Prediction,,Predict how much GStore customers will spend,RMSE,google-analytics-customer-revenue-prediction 201,"'`The aim is to create a dashboard representing the daily cases and death cases in the UK. In order for us to compete, we can look for whether each of the UK authorities has passed its peak date yet. An example dashboard in here: http://covid19dj.herokuapp.com/`'",,Python Fun - Covid19 Analysis,inClass,Discover the number of authorities in England have passed their peak cases,categorizationaccuracy,python-fun-covid19-analysis 202,"'`This competition is closed and no longer accepting submissions. The private leaderboard has been finalized as of 8/28/2019. Important Warning: This competition has an experimental format and submission style (images as submission). Competitors must use generative methods to create their submission images and are not permitted to make submissions that include any images already classified as dogs or altered versions of such images. To enforce and prevent cheating, we reserve the right to: (a) Visually inspect all participants' submitted images, (b) review any submitted source code, (c) use these reviews to identify violators or determine winners, and (d) disqualify participants from the competition who are found in violation. This is also specified in the competition's rules Use your training skills to create images, rather than identify them. Youll be using GANs, which are at the creative frontier of machine learning. You might think of GANs as robot artists in a senseable to create eerily lifelike images, and even digital worlds. ""You might not think that programmers are artists, but programming is an extremely creative profession. Its logic-based creativity. '' - John Romero A generative adversarial network (GAN) is a class of machine learning system invented by Ian Goodfellow in 2014. Two neural networks compete with each other in a game. Given a training set, this technique learns to generate new data with the same statistics as the training set. In this competition, youll be training generative models to create images of dogs. Only this time theres no ground truth data for you to predict. Here, youll submit the images and be scored based on how well those images are classified as dogs from pre-trained neural networks. Take these images, for example. Can you tell which are real vs. generated? Trick question; they are all generated! Why dogs? We chose dogs because, well, who doesnt love looking at photos of adorable pups? Moreover, dogs can be classified into many sub-categories (breed, color, size), making them ideal candidates for image generation. Generative methods (in particular, GANs) are currently used in various places on Kaggle for data augmentation. Their potential is vast; they can learn to mimic any distribution of data across any domain: photographs, drawings, music, and prose. If successful, not only will you help advance the state of the art in generative image creation, but youll enable us to create more experiments across a variety of domains in the future. This is a Kernels-only competition. Refer to Kernels Requirements for details.`'",,Generative Dog Images,,Experiment with creating puppy pics,PostProcessorKernel,generative-dog-images 203,"'`Get out your dowsing rods, electromagnetic sensors, and gradient boosting machines. Kaggle is haunted and we need your help. After a month of making scientific observations and taking careful measurements, weve determined that 900 ghouls, ghosts, and goblins are infesting our halls and frightening our data scientists. When trying garlic, asking politely, and using reverse psychology didn't work, it became clear that machine learning is the only answer to banishing our unwanted guests. So now the hour has come to put the data weve collected in your hands. Weve managed to identify 371 of the ghastly creatures, but need your help to vanquish the rest. And only an accurate classification algorithm can thwart them. Use bone length measurements, severity of rot, extent of soullessness, and other characteristics to distinguish (and extinguish) the intruders. Are you ghost-busters up for the challenge?`'",,"Ghouls, Goblins, and Ghosts... Boo!",,Can you classify monsters haunting Kaggle?,categorizationaccuracy,"ghouls,-goblins,-and-ghosts...-boo!" 204,"'`""Quick, Draw!"" was released as an experimental game to educate the public in a playful way about how AI works. The game prompts users to draw an image depicting a certain category, such as banana, table, etc. The game generated more than 1B drawings, of which a subset was publicly released as the basis for this competitions training set. That subset contains 50M drawings encompassing 340 label categories. Sounds fun, right? Here's the challenge: since the training data comes from the game itself, drawings can be incomplete or may not match the label. Youll need to build a recognizer that can effectively learn from this noisy data and perform well on a manually-labeled test set from a different distribution. Your task is to build a better classifier for the existing Quick, Draw! dataset. By advancing models on this dataset, Kagglers can improve pattern recognition solutions more broadly. This will have an immediate impact on handwriting recognition and its robust applications in areas including OCR (Optical Character Recognition), ASR (Automatic Speech Recognition) & NLP (Natural Language Processing).`'",,"Quick, Draw! Doodle Recognition Challenge",,How accurately can you identify a doodle?,MAP@{K},"quick,-draw!-doodle-recognition-challenge" 205,"'`An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world. Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers. In this competition, Kagglers will develop models that identify and flag insincere questions. To date, Quora has employed both machine learning and manual review to address this problem. With your help, they can develop more scalable methods to detect toxic and misleading content. Here's your chance to combat online trolls at scale. Help Quora uphold their policy of Be Nice, Be Respectful and continue to be a place for sharing and growing the worlds knowledge. Important Note Be aware that this is being run as a Kernels Only Competition, requiring that all submissions be made via a Kernel output. Please read the Kernels FAQ and the data page very carefully to fully understand how this is designed.`'",,Quora Insincere Questions Classification,,Detect toxic content to improve online conversations,FScore_1,quora-insincere-questions-classification 206,"'`Where else but Quora can a physicist help a chef with a math problem and get cooking tips in return? Quora is a place to gain and share knowledgeabout anything. Its a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world. Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term. Currently, Quora uses a Random Forest model to identify duplicate questions. In this competition, Kagglers are challenged to tackle this natural language processing problem by applying advanced techniques to classify whether question pairs are duplicates or not. Doing so will make it easier to find high quality answers to questions resulting in an improved experience for Quora writers, seekers, and readers.`'",,Quora Question Pairs,,Can you identify question pairs that have the same intent?,LogLoss,quora-question-pairs 207,"'`Get started on this competition through Kaggle Scripts In machine learning, it is often said there are no free lunches. How wrong we were. This competition contains a dataset with 5671 textual requests for pizza from the Reddit community Random Acts of Pizza together with their outcome (successful/unsuccessful) and meta-data. Participants must create an algorithm capable of predicting which requests will garner a cheesy (but sincere!) act of kindness. ""I'll write a poem, sing a song, do a dance, play an instrument, whatever! I just want a pizza,"" says one hopeful poster. What about making an algorithm? Kaggle is hosting this competition for the machine learning community to use for fun and practice. This data was collected and graciously shared by Althoff et al. (Buy them a pizza -- data collection is a thankless and tedious job!) We encourage participants to explore their accompanying paper and ask that you cite the following reference in any publications that result from your work: Tim Althoff, Cristian Danescu-Niculescu-Mizil, Dan Jurafsky. How to Ask for a Favor: A Case Study on the Success of Altruistic Requests, Proceedings of ICWSM, 2014.`'",,Random Acts of Pizza,playground,Predicting altruism through free pizza,AUC,random-acts-of-pizza 208,"'`Serious complications can occur as a result of malpositioned lines and tubes in patients. Doctors and nurses frequently use checklists for placement of lifesaving equipment to ensure they follow protocol in managing patients. Yet, these steps can be time consuming and are still prone to human error, especially in stressful situations when hospitals are at capacity. Hospital patients can have catheters and lines inserted during the course of their admission and serious complications can arise if they are positioned incorrectly. Nasogastric tube malpositioning into the airways has been reported in up to 3% of cases, with up to 40% of these cases demonstrating complications [1-3]. Airway tube malposition in adult patients intubated outside the operating room is seen in up to 25% of cases [4,5]. The likelihood of complication is directly related to both the experience level and specialty of the proceduralist. Early recognition of malpositioned tubes is the key to preventing risky complications (even death), even more so now that millions of COVID-19 patients are in need of these tubes and lines. The gold standard for the confirmation of line and tube positions are chest radiographs. However, a physician or radiologist must manually check these chest x-rays to verify that the lines and tubes are in the optimal position. Not only does this leave room for human error, but delays are also common as radiologists can be busy reporting other scans. Deep learning algorithms may be able to automatically detect malpositioned catheters and lines. Once alerted, clinicians can reposition or remove them to avoid life-threatening complications. The Royal Australian and New Zealand College of Radiologists (RANZCR) is a not-for-profit professional organisation for clinical radiologists and radiation oncologists in Australia, New Zealand, and Singapore. The group is one of many medical organisations around the world (including the NHS) that recognizes malpositioned tubes and lines as preventable. RANZCR is helping design safety systems where such errors will be caught. In this competition, youll detect the presence and position of catheters and lines on chest x-rays. Use machine learning to train and test your model on 40,000 images to categorize a tube that is poorly placed. The dataset has been labelled with a set of definitions to ensure consistency with labelling. The normal category includes lines that were appropriately positioned and did not require repositioning. The borderline category includes lines that would ideally require some repositioning but would in most cases still function adequately in their current position. The abnormal category included lines that required immediate repositioning. If successful, your efforts may help clinicians save lives. Earlier detection of malpositioned catheters and lines is even more important as COVID-19 cases continue to surge. Many hospitals are at capacity and more patients are in need of these tubes and lines. Quick feedback on catheter and line placement could help clinicians better treat these patients. Beyond COVID-19, detection of line and tube position will ALWAYS be a requirement in many ill hospital patients. This is a Code Competition. Refer to Code Requirements for details. Koopmann MC, Kudsk KA, Szotkowski MJ, Rees SM. A Team-Based Protocol and Electromagnetic Technology Eliminate Feeding Tube Placement Complications [Internet]. Vol. 253, Annals of Surgery. 2011. p. 297302. Available from: http://dx.doi.org/10.1097/sla.0b013e318208f550 Sorokin R, Gottlieb JE. Enhancing patient safety during feeding-tube insertion: a review of more than 2,000 insertions. JPEN J Parenter Enteral Nutr. 2006 Sep;30(5):4405. Marderstein EL, Simmons RL, Ochoa JB. Patient safety: effect of institutional protocols on adverse events related to feeding tube placement in the critically ill. J Am Coll Surg. 2004 Jul;199(1):3947; discussion 4750. Jemmett ME. Unrecognized Misplacement of Endotracheal Tubes in a Mixed Urban to Rural Emergency Medical Services Setting [Internet]. Vol. 10, Academic Emergency Medicine. 2003. p. 9615. Available from: http://dx.doi.org/10.1197/s1069-6563(03)00315-4 Lotano R, Gerber D, Aseron C, Santarelli R, Pratter M. Utility of postintubation chest radiographs in the intensive care unit. Crit Care. 2000 Jan 24;4(1):503.`'",,RANZCR CLiP - Catheter and Line Position Challenge,,Classify the presence and correct placement of tubes on chest x-rays to save lives,MCAUC,ranzcr-clip-catheter-and-line-position-challenge 209,"'`The aim of this problem is to emulate the temperature time series simulated by a regional climate model. Regional Climate Models are numerical tools built to simulated the long term (hundreds years) climate on a specific region (Europe) at high resolution (12km here). Because they work only on a limited area we force them at their boundaries with a Global Climate Model ( low resolution but over the whole planet ). Global Models produce a non reliable information at the local scale and that is why we use Regional Models to downscale this information. The aim of this competition is to try to predict the temperature series simulated by the regional model given the simulation from the global model. The complete presentation of the problem is available at this link : Intro As it is mentioned in the presentation, there are 2 different exercises here corresponding to two different outputs. In the first one we try to predict only the temperature in one grid point, while in the second one we try to predict over a whole region. The submissions for the second one will be done at this link : CODALAB`'",,Statistical Emulators for RCMs,,Reproduce the temperature time series simulated by climate models,rmse,statistical-emulators-for-rcms 210,"'`We might be on the verge of too many screens. It seems like everyday, new versions of common objects are re-invented with built-in wifi and bright touchscreens. A promising antidote to our screen addiction are voice interfaces. But, for independent makers and entrepreneurs, its hard to build a simple speech detector using free, open data and code. Many voice recognition datasets require preprocessing before a neural network model can be built on them. To help with this, TensorFlow recently released the Speech Commands Datasets. It includes 65,000 one-second long utterances of 30 short words, by thousands of different people. In this competition, you're challenged to use the Speech Commands Dataset to build an algorithm that understands simple spoken commands. By improving the recognition accuracy of open-sourced voice interface tools, we can improve product effectiveness and their accessibility.`'",,TensorFlow Speech Recognition Challenge,,Can you build an algorithm that understands simple speech commands?,CategorizationAccuracy,tensorflow-speech-recognition-challenge 211,"'`Why is the sky blue? This is a question an open-domain question answering (QA) system should be able to respond to. QA systems emulate how people look for information by reading the web to return answers to common questions. Machine learning can be used to improve the accuracy of these answers. Existing natural language models have been focused on extracting answers from a short paragraph rather than reading an entire page of content for proper context. As a result, the responses can be complicated or lengthy. A good answer will be both succinct and relevant. In this competition, your goal is to predict short and long answer responses to real questions about Wikipedia articles. The dataset is provided by Google's Natural Questions, but contains its own unique private test set. A visualization of examples shows long andwhere availableshort answers. In addition to prizes for the top teams, there is a special set of awards for using TensorFlow 2.0 APIs. If successful, this challenge will help spur the development of more effective and robust QA systems. About TensorFlow TensorFlow is an open source platform for machine learning. With TensorFlow 2.0, tf.keras is the preferred high-level API for TensorFlow, to make model building easier and more intuitive. You may use the tf.keras built-in compile()/fit() methods, or write your own custom training loops. See the Effective TensorFlow 2.0 guide and the tf.keras guide for more details. TensorFlow 2.0 was recently released and this competition is to challenge Kagglers to use TensorFlow 2.0s APIs focused on usability, and easier, more intuitive development, to make advancements on Question Answering.`'",,TensorFlow 2.0 Question Answering,,Identify the answers to real user questions about Wikipedia page content,NQMicroF1,tensorflow-2.0-question-answering 212,"'`As many of us can attest, learning another language is tough. Picking up on nuances like slang, dates and times, and local expressions, can often be a distinguishing factor between proficiency and fluency. This challenge is even more difficult for a machine. Many speech and language applications, including text-to-speech synthesis (TTS) and automatic speech recognition (ASR), require text to be converted from written expressions into appropriate ""spoken"" forms. This is a process known as text normalization, and helps convert 12:47 to ""twelve forty-seven"" and $3.16 into ""three dollars, sixteen cents."" However, one of the biggest challenges when developing a TTS or ASR system for a new language is to develop and test the grammar for all these rules, a task that requires quite a bit of linguistic sophistication and native speaker intuition. A baby giraffe is 6ft six feet tall and weighs 150lb one hundred fifty pounds . sil In this competition, you are challenged to automate the process of developing text normalization grammars via machine learning. This track will focus on English, while a separate will focus on Russian here: Russian Text Normalization Challenge About the sponsor Google's Text Normalization Research Group conducts research and creates tools for the detection, normalization and denormalization of non-standard words such as abbreviations, numbers or currency expressions; and semiotic classes -- text tokens and token sequences that represent particular entities that are semantically constrained, such as measure phrases, addresses or dates. Applications of this work include text-to-speech synthesis, automatic speech recognition, and information extraction/retrieval.`'",,Text Normalization Challenge - English Language,,Convert English text from written expressions into spoken forms,CategorizationAccuracy,text-normalization-challenge-english-language 213,"'` , Avito . Baseline `'",,Texts classification,inClass,Classify texts from VK into 4 categories,categorizationaccuracy,texts-classification 214,"'`Nearly half of the world depends on seafood for their main source of protein. In the Western and Central Pacific, where 60% of the worlds tuna is caught, illegal, unreported, and unregulated fishing practices are threatening marine ecosystems, global seafood supplies and local livelihoods. The Nature Conservancy is working with local, regional and global partners to preserve this fishery for the future. Currently, the Conservancy is looking to the future by using cameras to dramatically scale the monitoring of fishing activities to fill critical science and compliance monitoring data gaps. Although these electronic monitoring systems work well and are ready for wider deployment, the amount of raw data produced is cumbersome and expensive to process manually. The Conservancy is inviting the Kaggle community to develop algorithms to automatically detect and classify species of tunas, sharks and more that fishing boats catch, which will accelerate the video review process. Faster review and more reliable data will enable countries to reallocate human capital to management and enforcement activities which will have a positive impact on conservation and our planet. Machine learning has the ability to transform what we know about our oceans and how we manage them. You can be part of the solution. Resources You can learn more about this competition and The Nature Conservancy in the video below.`'",,The Nature Conservancy Fisheries Monitoring,,Can you detect and classify species of fish?,MulticlassLoss,the-nature-conservancy-fisheries-monitoring 215,"'`Do you laugh (and then get down to work) in the face of terabytes of noisy, non-stationary data? Winton Capital is looking for data scientists who excel at finding the hidden signal in the proverbial haystack, and who are excited by creating novel statistical modelling and data mining techniques. In this recruiting competition, Winton challenges you to take on the very difficult task of predicting the future (stock returns). Given historical stock performance and a host of masked features, can you predict intra and end of day returns without being deceived by all the noise? Research scientists at Winton have crafted this competition to be challenging and fun for the community while providing a taste of the types of problems they work on everyday. They're excited to connect with Kagglers who bring a unique background and creative approach to the competition. Winton is offering cash prizes to winning teams as a reward for their work, but the intent of the competition is not commercial. The intellectual property you create remains your own and will be evaluated in the context of suitability for employment. For more on the culture at Winton, check out the About Winton page or their careers page.`'",,The Winton Stock Market Challenge,,Join a multi-disciplinary team of research scientists,WMAE,the-winton-stock-market-challenge 216,"'` Ahoy, welcome to Kaggle! Youre in the right place. This is the legendary Titanic ML competition the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works. The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. Read on or watch the video below to explore more details. Once youre ready to start competing, click on the ""Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cooks Titanic Tutorial that walks you through step by step how to make your first submission! The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered unsinkable RMS Titanic sank after colliding with an iceberg. Unfortunately, there werent enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: what sorts of people were more likely to survive? using passenger data (ie name, age, gender, socio-economic class, etc). Recommended Tutorial We highly recommend Alexis Cooks Titanic Tutorial that walks you through making your very first submission step by step. Overview of How Kaggles Competitions Work 1. Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. 2. Get to Work Download the data, build models on it locally or on Kaggle Kernels (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. 3. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. 4. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. 5. Improve Your Score Check out the discussion forum to find lots of tutorials and insights from other competitors. Kaggle Lingo Video You may run into unfamiliar lingo as you dig into the Kaggle discussion forums and public notebooks. Check out Dr. Rachael Tatmans video on Kaggle Lingo to get up to speed! What Data Will I Use in This Competition? In this competition, youll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled `train.csv` and the other is titled `test.csv`. Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the ground truth. The `test.csv` dataset contains similar information but does not disclose the ground truth for each passenger. Its your job to predict these outcomes. Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived. Check out the Data tab to explore the datasets even further. Once you feel youve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers. How to Submit your Prediction to Kaggle Once youre ready to make a submission and get on the leaderboard: 1. Click on the Submit Predictions button 2. Upload a CSV file in the submission file format. Youre able to submit 10 submissions a day. Submission File Format: You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows. The file should have exactly 2 columns: PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) Got it! Im ready to get started. Where do I get help if I need it? For Competition Help: Titanic Discussion Forum Technical Help: Kaggle Contact Us Page Kaggle doesnt have a dedicated support team so youll typically find that you receive a response more quickly by asking your question in the appropriate forum. The forums are full of useful information on the data, metric, and different approaches. We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn! A Last Word on Kaggle Notebooks As we mentioned before, Kaggle Notebooks is our no-setup, customizable, Jupyter Notebooks environment with free GPUs and a huge repository of community published data & code. In every competition, youll find many Kernels publically shared with incredible insights. Its an invaluable resource worth becoming familiar with. Check out this competitions Kernels here. Ready to Compete? Join the Competition Here!`'",,Titanic - Machine Learning from Disaster,,Start here! Predict survival on the Titanic and get familiar with ML basics,categorizationaccuracy,titanic-machine-learning-from-disaster 217,"'`We're going to make you an offer you can't refuse: a Kaggle competition! In a world where movies made an estimated $41.7 billion in 2018, the film industry is more popular than ever. But what movies make the most money at the box office? How much does a director matter? Or the budget? For some movies, it's ""You had me at 'Hello.'"" For others, the trailer falls short of expectations and you think ""What we have here is a failure to communicate."" In this competition, you're presented with metadata on over 7,000 past films from The Movie Database to try and predict their overall worldwide box office revenue. Data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries. You can collect other publicly available data to use in your model predictions, but in the spirit of this competition, use only data that would have been available before a movie's release. Join in, ""make our day"", and then ""you've got to ask yourself one question: 'Do I feel lucky?'""`'",,TMDB Box Office Prediction,,Can you predict a movie's worldwide box office revenue?,RMSLE,tmdb-box-office-prediction 218,"'`Learn how to use Tensor Processing Units (TPUs) on Kaggle TPUs are powerful hardware accelerators specialized in deep learning tasks. They were developed (and first used) by Google to process large image databases, such as extracting all the text from Street View. This competition is designed for you to give TPUs a try. TPU quotas are available on Kaggle at no cost to users. Watch the video below to see how to get started! You can follow along with this notebook. The Challenge Its difficult to fathom just how vast and diverse our natural world is. There are over 5,000 species of mammals, 10,000 species of birds, 30,000 species of fish and astonishingly, over 400,000 different types of flowers. In this competition, youre challenged to build a machine learning model that identifies the type of flowers in a dataset of images (for simplicity, were sticking to just over 100 types). Recommended Tutorial We highly recommend Ryan Holbrooks Tutorial that walks you through making your very first submission step by step. Have Questions? Kaggle Data Scientists will be actively monitoring the competition forum - your fellow data scientists and TPU users will be there too! If you have a question or need help troubleshooting, thats the best place to find help. Learn More Check out Kaggles Youtube playlist for more videos introducing TPUs. Read the TPU documentation for more information and resources. Many thanks to Martin Grner, Google Developer Advocate and author of Tensorflow without a PhD for his tireless work on the dataset, the notebooks, and the original competition that this Getting Started competition draws from.`'",,Petals to the Metal - Flower Classification on TPU,,Getting Started with TPUs on Kaggle!,macrofscore,petals-to-the-metal-flower-classification-on-tpu 219,"'`To explore what our universe is made of, scientists at CERN are colliding protons, essentially recreating mini big bangs, and meticulously observing these collisions with intricate silicon detectors. While orchestrating the collisions and observations is already a massive scientific accomplishment, analyzing the enormous amounts of data produced from the experiments is becoming an overwhelming challenge. Event rates have already reached hundreds of millions of collisions per second, meaning physicists must sift through tens of petabytes of data per year. And, as the resolution of detectors improve, ever better software is needed for real-time pre-processing and filtering of the most promising events, producing even more data. To help address this problem, a team of Machine Learning experts and physics scientists working at CERN (the world largest high energy physics laboratory), has partnered with Kaggle and prestigious sponsors to answer the question: can machine learning assist high energy physics in discovering and characterizing new particles? Specifically, in this competition, youre challenged to build an algorithm that quickly reconstructs particle tracks from 3D points left in the silicon detectors. This challenge consists of two phases: The Accuracy phase has run on Kaggle from May to 13th August 2018 (Winners to be announced by end September). Here well be focusing on the highest score, irrespective of the evaluation time. This phase is an official IEEE WCCI competition (Rio de Janeiro, Jul 2018). The Throughput phase will run on Codalab starting in September 2018. Participants will submit their software which is evaluated by the platform. Incentive is on the throughput (or speed) of the evaluation while reaching a good score. This phase is an official NIPS competition (Montreal, Dec 2018). All the necessary information for the Accuracy phase is available here on Kaggle site. The overall TrackML challenge web site is there.`'",,TrackML Particle Tracking Challenge,,High Energy Physics particle tracking in CERN detectors,TrackML,trackml-particle-tracking-challenge 220,"'`In the late 90's, Yann LeCun's team pioneered the successful application of machine learning to optical character recognition. 25 years later, machine learning continues to be an invaluable tool for text processing downstream from the OCR process. Tradeshift has created a dataset with thousands of documents, representing millions of words. In each document, several bounding boxes containing text are selected. For each piece of text, many features are extracted and certain labels are assigned. In this competition, participants are asked to create and open source an algorithm that correctly predicts the probability that a piece of text belongs to a given class.`'",,Tradeshift Text Classification,featured,Classify text blocks in documents,LogLoss,tradeshift-text-classification 221,"'`What does physics have in common with biology, cooking, cryptography, diy, robotics, and travel? If you answered ""all pursuits are governed by the immutable laws of physics"" we'll begrudgingly give you partial credit. If you answered ""all were chosen randomly by a scheming Kaggle employee for a twisted transfer learning competition"", congratulations, we accept your answer and mark the question as solved. In this competition, we provide the titles, text, and tags of Stack Exchange questions from six different sites. We then ask for tag predictions on unseen physics questions. Solving this problem via a standard machine approach might involve training an algorithm on a corpus of related text. Here, you are challenged to train on material from outside the field. Can an algorithm learn appropriate physics tags from ""extreme-tourism Antarctica""? Let's find out. Kaggle is hosting this competition for the data science community to use for fun and education. This dataset originates from the Stack Exchange data dump.`'",,Transfer Learning on Stack Exchange Tags,,Predict tags from models trained on unrelated topics,MeanFScore,transfer-learning-on-stack-exchange-tags 222,"'`This competition will launch at midnight UTC on Saturday, December 15. Santa Claus was excited to learn about the Kaggle competition platform, and wanted to use it for a slightly different purpose. Rather than a predictive modeling problem, he has an optimization problem for you: a very, very important optimization problem. Santa needs help choosing the route he takes when delivering presents around the globe. Every year, Santa has to visit every boy and girl on his list. It's a tough challenge, and Santa admits he scored a B- on his combinatorical optimization final. He's hoping you can develop algorithms that will solve his problem year after year. Santa asked that we give you one particular instance of his TSP (Traveling Santa Problem). However, Santa's dilemma isn't quite the same as the Traveling Salesman Problem with which you may be familiar. Santa likes to see new terrain every year--don't ask, it's a reindeer thing--and doesn't want his route to be predictable. You're looking for shortest-distance paths through a set of chimneys, but instead of providing one path, Santa asks you to provide two disjoint paths. If one of your paths contains an edge from A to B, your other path must not contain an edge from A to B or from B to A (either order still counts as using that edge). Your score is the larger of the two distances. Santa asks competition winners to publish and open source the algorithms they use (for his future use, of course). Rudolph was very adament about minimizing his workload. Trust us, you don't want to be on Rudolph's bad side. Important note about prizes: We believe that Kaggle's public leaderboard is very important for both the fun of the competition and achieving great results, and we want to provide an incentive for everyone to submit to the public leaderboard all along the way (even though you can easily determine your submission's score all by yourself). So the competition will have two sets of prizes, one based on the scores at the end of the competition, and one based on the scores at the end of a randomly chosen day (UTC) between December 23 and January 17. The day will not be revealed (or even chosen) until after the competition ends. (The competition will end at the end of the day UTC on January 18.) Attributions: Data generation and lots of help framing the problem (including coming up with this TSP variant): Robert Bosch of Oberlin College Math Department Santa photo: AurlienS Sleigh photo: Creative Tools Globe: William Cook `'",,Traveling Santa Problem,featured,Solve ye olde traveling salesman problem to help Santa Claus deliver his presents,TravelingSanta,traveling-santa-problem 223,"'`LAUNCHED This competition was launched and opened for submissions on May 27th 2020. Submissions will close in 1 week at 11:00 AM UTC on June 3rd 2020. The public leaderboard is based on the TREC-COVID Round 2 dataset. The private leaderboard will be based on the Round 3 dataset, which will be evaluated after the competition closes. Review the Data page for more details. Researchers, clinicians, and policy makers involved with the response to COVID-19 are constantly searching for reliable information on the virus and its impact. This presents a unique opportunity for the information retrieval (IR) and text processing communities to contribute to the response to this pandemic, as well as to study methods for quickly standing up information systems for similar future events. The results of the TREC-COVID Challenge will identify answers for some of today's questions while building infrastructure to improve tomorrow's search systems. Kaggle first teamed up with the Allen Institute for AI in the launch of the COVID-19 Open Research Dataset (CORD-19). TREC-COVID builds on the CORD-19 Challenge by using the same document set, a collection of biomedical literature articles that has been updated on a weekly rolling basis. This is the 3rd Round of the TREC-COVID Challenge. Prior runs were hosted directly on the TREC-COVID Site. For this round, you have the option to submit on Kaggle or directly to the TREC-COVID platform. The organizers have added 5 additional COVID-related topics to the 35 topics from the first two rounds, for a total of 40 topics. You will create a retrieval system that returns ranked lists of documents from CORD-19 for (a) each of these additional Round 3 topics (""runs"") and as well as (b) residual rankings on the completed Round 1 & 2 topics, i.e., for any documents not judged in the CORD-19 dataset (not previously included as a ranked document). The eligible population of documents for Round 3 is anything included in the CORD-19 release up to Round 3's launch date, last updated on May 19th 2020. Following the close of Round 3, NIST will gather the collective set of participants' runs, to include those participants submitting directly through TREC-COVID. The organizers will then assess some reasonable subset of these submissions for relevance by human annotators with biomedical expertise. The results of the human annotation, known as relevance judgments, will then be used to score the submitted runs. It is important to understand that not all documents will be assessed, and thus the private leaderboard score will be based on partial document assessment. With your help, the final document and topic sets together with the cumulative relevance judgments will comprise a COVID test collection. The incremental nature of the collection will support research on search systems for dynamic environments. Acknowledgments The Text REtrieval Conference (TREC) was founded in 1992 to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. The TREC-COVID Challenge is being organized by the Allen Institute for Artificial Intelligence (AI2), the National Institute of Standards and Technology (NIST), the National Library of Medicine (NLM), Oregon Health and Science University (OHSU), and the University of Texas Health Science Center at Houston (UTHealth). See the NIST press release for more information.`'",,TREC-COVID Information Retrieval,,Build a pandemic document retrieval system,NDCG@{K},trec-covid-information-retrieval 224,"'`""My ridiculous dog is amazing."" [sentiment: positive] With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. But, which words actually lead to the sentiment description? In this competition you will need to pick out the part of the tweet (word or phrase) that reflects the sentiment. Help build your skills in this important area with this broad dataset of tweets. Work on your technique to grab a top spot in this competition. What words in tweets support a positive, negative, or neutral sentiment? How can you help make that determination using machine learning tools? In this competition we've extracted support phrases from Figure Eight's Data for Everyone platform. The dataset is titled Sentiment Analysis: Emotion in Text tweets with existing sentiment labels, used here under creative commons attribution 4.0. international licence. Your objective in this competition is to construct a model that can do the same - look at the labeled sentiment for a given tweet and figure out what word or phrase best supports it. Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.`'",,Tweet Sentiment Extraction,,Extract support phrases for sentiment labels,Jaccard,tweet-sentiment-extraction 225,"'`Picture yourself strolling through your local, open-air market... What do you see? What do you smell? What will you make for dinner tonight? If you're in Northern California, you'll be walking past the inevitable bushels of leafy greens, spiked with dark purple kale and the bright pinks and yellows of chard. Across the world in South Korea, mounds of bright red kimchi greet you, while the smell of the sea draws your attention to squids squirming nearby. Indias market is perhaps the most colorful, awash in the rich hues and aromas of dozens of spices: turmeric, star anise, poppy seeds, and garam masala as far as the eye can see. Some of our strongest geographic and cultural associations are tied to a region's local foods. This playground competitions asks you to predict the category of a dish's cuisine given a list of its ingredients. Acknowledgements We want to thank Yummly for providing this unique dataset. Kaggle is hosting this playground competition for fun and practice.`'",text data,What's Cooking?,playground,Use recipe ingredients to categorize the cuisine,CategorizationAccuracy,whats-cooking? 226,"'`Discovery of the long awaited Higgs boson was announced July 4, 2012 and confirmed six months later. 2013 saw a number of prestigious awards, including a Nobel prize. But for physicists, the discovery of a new particle means the beginning of a long and difficult quest to measure its characteristics and determine if it fits the current model of nature. A key property of any particle is how often it decays into other particles. ATLAS is a particle physics experiment taking place at the Large Hadron Collider at CERN that searches for new particles and processes using head-on collisions of protons of extraordinarily high energy. The ATLAS experiment has recently observed a signal of the Higgs boson decaying into two tau particles, but this decay is a small signal buried in background noise. The goal of the Higgs Boson Machine Learning Challenge is to explore the potential of advanced machine learning methods to improve the discovery significance of the experiment. No knowledge of particle physics is required. Using simulated data with features characterizing events detected by ATLAS, your task is to classify events into ""tau tau decay of a Higgs boson"" versus ""background."" The winning method may eventually be applied to real data and the winners may be invited to CERN to discuss their results with high energy physicists. Acknowledgements This competition is brought to you by Additional support from:`'",tabular data,Higgs Boson Machine Learning Challenge,featured,Use the ATLAS experiment to identify the Higgs boson,HiggsBosonApproximateMedianSignificance,higgs-boson-machine-learning-challenge 227,"'`Planning a celebration is a balancing act of preparing just enough food to go around without being stuck eating the same leftovers for the next week. The key is anticipating how many guests will come. Grupo Bimbo must weigh similar considerations as it strives to meet daily consumer demand for fresh bakery products on the shelves of over 1 million stores along its 45,000 routes across Mexico. Currently, daily inventory calculations are performed by direct delivery sales employees who must single-handedly predict the forces of supply, demand, and hunger based on their personal experiences with each store. With some breads carrying a one week shelf life, the acceptable margin for error is small. In this competition, Grupo Bimbo invites Kagglers to develop a model to accurately forecast inventory demand based on historical sales data. Doing so will make sure consumers of its over 100 bakery products arent staring at empty shelves, while also reducing the amount spent on refunds to store owners with surplus product unfit for sale.`'",tabular data,Grupo Bimbo Inventory Demand,featured,Maximize sales and minimize returns of bakery goods,RMSLE,grupo-bimbo-inventory-demand 228,"'`Before asking someone on a date or skydiving, it's important to know your likelihood of success. The same goes for quoting home insurance prices to a potential customer. Homesite, a leading provider of homeowners insurance, does not currently have a dynamic conversion rate model that can give them confidence a quoted price will lead to a purchase. Using an anonymized database of information on customer and sales activity, including property and coverage information, Homesite is challenging you to predict which customers will purchase a given quote. Accurately predicting conversion would help Homesite better understand the impact of proposed pricing changes and maintain an ideal portfolio of customer segments. `'",tabular data,Homesite Quote Conversion,featured,Which customers will purchase a quoted insurance plan?,AUC,homesite-quote-conversion 229,"'`Even the bravest patient cringes at the mention of a surgical procedure. Surgery inevitably brings discomfort, and oftentimes involves significant post-surgical pain. Currently, patient pain is frequently managed through the use of narcotics that bring a bevy of unwanted side effects. This competition's sponsor is working to improve pain management through the use of indwelling catheters that block or mitigate pain at the source. Pain management catheters reduce dependence on narcotics and speed up patient recovery. Accurately identifying nerve structures in ultrasound images is a critical step in effectively inserting a patients pain management catheter. In this competition, Kagglers are challenged to build a model that can identify nerve structures in a dataset of ultrasound images of the neck. Doing so would improve catheter placement and contribute to a more pain free future. `'",image data,Ultrasound Nerve Segmentation,featured,Identify nerve structures in ultrasound images of the neck,Dice,ultrasound-nerve-segmentation 230,"'`One challenge of modeling retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line. In this recruiting competition, job-seekers are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and participants must project the sales for each department in each store. To add to the challenge, selected holiday markdown events are included in the dataset. These markdowns are known to affect sales, but it is challenging to predict which departments are affected and the extent of the impact. Want to work in a great environment with some of the world's largest data sets? This is a chance to display your modeling mettle to the Walmart hiring teams. This competition counts towards rankings & achievements. If you wish to be considered for an interview at Walmart, check the box ""Allow host to contact me"" when you make your first entry. You must compete as an individual in recruiting competitions. You may only use the provided data to make your predictions.`'",tabular data,Walmart Recruiting - Store Sales Forecasting,recruitment,Use historical markdown data to predict store sales,WMAE,walmart-recruiting-store-sales-forecasting 231,"'`Description Dataset for practicing classification -use NBA rookie stats to predict if player will last 5 years in league? Practice Skills Machine Learning fundamentals Binary classification Python or R basics`'",tabular data,NBA Rookies,inClass,Predict Career for NBA Rookies,categorizationaccuracy,nba-rookies 232,"'`Overview Color-filtered fundus images visualize the rear of an eye called retina (Figure 1). Fundus image provides doctors with a snapshot on the interior of the eye of patients. Based on this type of image, doctor will be able to read abnormalities present on the back of the eye, thus making diagnosis easier and more accurate. Many eye diseases can be found using fundus images, such as diabetic retinopathy, glaucoma, and macular degeneration. Current available public datasets (EYEPACS, Messidor, etc.), although rich in quantity, only focus on diabetic retinopathy. However, more often than not, a patient can have two or more diseases concurrently. To overcome these 2 problems, we take other common diseases into consideration and create a multi-labeled dataset. The following will describe the dataset in details. Problem description The dataset includes 3,285 images from CTEH (3.210 abnormals and 75 normals) and 500 normal images from Messidor and EYEPACS dataset. The abnormalities include: opacity, diabetic retinopathy, glaucoma, macular edema, macular degeneration, and retinal vascular occlusion In this assignment, we will use 3,435 images for training and predict diseases on 350 unlabeled images. Acknowledgements We thank AI Department Cao Thang Eye Hospital (CTEH) for providing this dataset`'",image data,VietAI Advance Course - Retinal Disease Detection,inClass,Assignment for VietAI Advance Course 2020 - Deep Learning in Vision,meanfscore,vietai-advance-course-retinal-disease-detection 233,"'`Os dados so do IGM, e iremos prever se a nota de matemtica de um municpio na prova de matemtica do ENEM est acima ou abaixo da mediana Brasil. As notas sero calculadas de acordo com os seguintes critrios: Nota mais alta na competio Melhor Kernel de anlise exploratria na competio Melhor Kernel de anlise de feature importances Cronograma: 20/05 - Aula 06 - Elaborao de Modelo do Kaggle 27/05 - Aula 07 - Apresentao dos modelos`'",tabular data,IESB Norte - IGM - Maio 2019,inClass,ENEM mathematics grade prediction,categorizationaccuracy,iesb-norte-igm-maio-2019 234,"'`Os dados so do IGM, e iremos prever se a nota de matemtica de um municpio na prova de matemtica do ENEM est acima ou abaixo da mediana Brasil (0 abaixo da mediana e 1 acima). As notas sero calculadas de acordo com os seguintes critrios: Nota mais alta na competio Melhor Kernel de anlise exploratria na competio Melhor Kernel de anlise de feature importances O melhor trabalho em cada um dos 3 critrios acima ir ser premiado. com um vale R$50 no Uber Eats para gastar durante a ltima aula do curso. Cronograma: 21/05 - Aula 06 - Elaborao de Modelo do Kaggle 28/05 - Aula 07 - Finalizar Modelo do Kaggle (apresentao dos melhores trabalhos)`'",tabular data,IESB Sul - IGM - Maio 2019,inClass,ENEM mathematics grade prediction,categorizationaccuracy,iesb-sul-igm-maio-2019 235,"'`In advance of the March 4, 2019 Global WiDS Conference, the Global WiDS team, the West Big Data Innovation Hub, and the WiDS Datathon Committee have been working with Planet and Figure Eight to bring a dataset of high-resolution satellite imagery to participants, building awareness about deforestation and oil palm plantations. We invite you to build a team, hone your data science skills, and join us in this predictive analytics challenge focused on social impact. Keep reading to learn more about the datathon, the significance of oil palm, and how to get started. UPDATE: The Sign Up and Merger Deadline is now Feb. 24 11:59PM UTC, but please note Kaggle has scheduled maintenance Feb. 22 5PM UTC - Feb. 23 5AM UTC. Also, the official WiDS 2019 Datathon Participant Form and opportunity to be eligible for additional prizes is now up at http://bit.ly/WiDSdatathon2019form Why oil palm? Deforestation through oil palm plantation growth represents an agricultural trend with large economic and environmental impacts. From shampoo to donuts and ice cream, oil palm is present in many everyday productsbut many have never heard of it explicitly! Because oil palm grows only in tropical environments, the crops expansion has led to deforestation, increased carbon emissions, and biodiversity loss, while at the same time providing many valuable jobs. With the economic livelihoods of millions and the ecosystems of the tropics at stake, how might we work towards affordable, timely, and scalable ways to address the expansion and management of oil palm throughout the world? High-resolution satellite imagery is a global, regularly-updated, and accurate source of data. Coupled with computer vision algorithms, it presents a promising opportunity for automated mapping of oil palm plantations, an important step toward understanding global impact. Who can participate We invite anyone from those new to data science to veterans of the field to participate. For those who have never tried machine learning or worked with satellite data before, we will be releasing a series of guides to help you get started with the algorithms and dataset. Get the latest resources by following #WiDSDatathon on social media and visiting widsconference.org/datathon. The WiDS Datathon aims to inspire women worldwide to learn more about data science, and to create a supportive environment for women to connect with others in their community who share their interests. Toward these ends, we open the datathon to individuals or teams of up to 4, at least half of each team must be women (individuals identifying as female). Participants can be students, faculty, government workers, members of NGOs, or industry members. More details The challenge is to create a model that predicts the presence of oil palm plantations in satellite imagery. Planet and Figure Eight have generously provided an annotated dataset of satellite images recently taken by Planet satellites. The dataset images are 3-meter spatial resolution, and each is labeled with whether an oil palm plantation appears in the image (0 for no plantation, 1 for any presence of a plantation). The datathon task is to train a model that takes as input a satellite image and outputs a prediction of how likely it is that the image contains an oil palm plantation. Labeled training and test datasets are provided for model development; you will then upload your predictions for an unlabeled test set to Kaggle and these predictions will be used to determine the public leaderboard rankings, and the final winners of the competition. Data analysis can be done using your preferred tools. The winners will be determined by the leaderboard on the Kaggle platform at the time the contest closes February 27. For more details and answers to frequently asked questions, please visit our FAQ page Acknowledgements The WiDS Datathon 2019 is a collaboration led by the Global WiDS team at Stanford, the West Big Data Innovation Hub, and the WiDS Datathon Committee. Special thanks to data providers Planet and Figure Eight, as well as our growing community of sponsors and supporters.`'",image data,WiDS Datathon 2019,inClass,Join the Women in Data Science (WiDS) Datathon 2019,auc,wids-datathon-2019 236,"'`Objetivo Aplicar los conceptos vistos en las prcticas de Aprendizaje Automtico de la asignatura de Inteligencia Artificial en la resolucin de una aplicacin prctica, en el entorno de una competicin, acogida en el entorno Kaggle, en la que cada estudiante intentar obtener la mejor puntuacin.`'",tabular data,IA1819,inClass,Entregable 2 de la asignatura IA curso 18/19,f_{beta},ia1819 237,"'`Camera Traps (or Wild Cams) enable the automatic collection of large quantities of image data. Biologists all over the world use camera traps to monitor biodiversity and population density of animal species. We have recently been making strides towards automating the species classification challenge in camera traps, but as we try to expand the scope of these models from specific regions where we have collected training data to nearby areas we are faced with an interesting probem: how do you classify a species in a new region that you may not have seen in previous training data? In order to tackle this problem, we have prepared a challenge where the training data and test data are from different regions, namely The American Southwest and the American Northwest. The species seen in each region overlap, but are not identical, and the challenge is to classify the test species correctly. To this end, we will allow training on our American Southwest data (from CaltechCameraTraps), on iNaturalist 2017/2018 data, and on simulated data generated from Microsoft AirSim. We have provided a taxonomy file mapping our classes into the iNat taxonomy. This is an FGVCx competition as part of the FGVC6 workshop at CVPR 2019, and is sponsored by Microsoft AI for Earth. There is a github page for the competition here. Please open an issue if you have questions or problems with the dataset. If you use this dataset in publication, please cite: @article{beery2019iwildcam, title={The iWildCam 2019 Challenge Dataset}, author={Beery, Sara and Morris, Dan and Perona, Pietro}, journal={arXiv preprint arXiv:1907.07617}, year={2019} } Kaggle is excited to partner with research groups to push forward the frontier of machine learning. Research competitions make use of Kaggle's platform and experience, but are largely organized by the research group's data science team. Any questions or concerns regarding the competition data, quality, or topic will be addressed by them.`'",image data,iWildCam 2019 - FGVC6,playground,Categorize animals in the wild,MacroFScore,iwildcam-2019-fgvc6 238,"'`Briefing Today you need to predict the house prices in the USA. You can only use linear models to solve this challenge. Use of raw features is not the best idea, but you may create your own ones. Be creative and remember what you were taught!`'",tabular data,House pricing,inClass,House sales in USA,mape,house-pricing 239,"'`Data analytics capabilities enables us to analyze data with greater depth, sophistication and efficiencies through innovations such as artificial intelligence, machine learning, natural learning processing and bots. Indian Society for Technical Education Student's Chapter NIT Hamirpur has launched DATATHON, which invites participants from NIT Hamirpur, who will join us in our quest of developing models for the appropriate sourcing and usage of data across businesses. Participants are required to implement OHLCV time series analysis on the given dataset. The prizes will be given upon producing Identity card of NIT Hamirpur. What is OHLCV? The columns of the datasets are:- O: Open Value H: High Value L: Low Value C: Close Value V: Volume Sell To read more about this type of data-sets refer here. For reference and clear understanding, sample code is provided here.`'",tabular data,Team ISTE's Datathon,inClass,First-ever Data Science Competition Held at NIT Hamirpur,rmse,team-istes-datathon 240,"'`Each example in this classification task is a movie review. The goal is to predict whether the review is a positive or a negative one. The data used for this task is based on the Large Movie Review Dataset v1.0[1]. Each review in this task is characterized by the histogram of the words it contains. You are also provided with the raw text that was used to try out new features. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).`'",text data,Classifying Movie Reviews,inClass,Rating movies for fun and profit,categorizationaccuracy,classifying-movie-reviews 241,"'`This competition focuses on the problem of forecasting the future values of multiple time series, as it has always been one of the most challenging problems in the field. More specifically, we aim the competition at testing state-of-the-art methods designed by the participants, on the problem of forecasting future web traffic for approximately 145,000 Wikipedia articles. Sequential or temporal observations emerge in many key real-world problems, ranging from biological data, financial markets, weather forecasting, to audio and video processing. The field of time series encapsulates many different problems, ranging from analysis and inference to classification and forecast. What can you do to help predict future views? This competition will run as two stages and involves prediction of actual future events. There will be a training stage during which the leaderboard is based on historical data, followed by a stage where participants are scored on real future events. You have complete freedom in how to produce your forecasts: e.g. use of univariate vs multi-variate models, use of metadata (article identifier), hierarchical time series modeling (for different types of traffic), data augmentation (e.g. using Google Trends data to extend the dataset), anomaly and outlier detection and cleaning, different strategies for missing value imputation, and many more types of approaches. We thank Google Inc. and Voleon for sponsorship of this competition, and Oren Anava and Vitaly Kuznetsov for organizing it. Kaggle is excited to partner with research groups to push forward the frontier of machine learning. Research competitions make use of Kaggle's platform and experience, but are largely organized by the research group's data science team. Any questions or concerns regarding the competition data, quality, or topic will be addressed by them.`'",tabular data,Web Traffic Time Series Forecasting,research,Forecast future traffic to Wikipedia pages,SMAPE,web-traffic-time-series-forecasting 242,"'`Virtual Hackathon Participate in virtual hackathon for scholars of Secure and Private AI Scholarship Challenge from Facebook conducted by #sghackathonorgnizrs. Come join us for a fun filled weekend of coding and competing against each other. When is it? Hackathon starts => Saturday 00:01am GMT to Monday 11:59am GMT Coding Time => Saturday 00:01am GMT to Sunday 11:59pm GMT . Commiting Kernel => Sunday 00:01am GMT to Monday 11:59am GMT How to participate? Use this form to sign up. You can participate alone or as part of a team of up to 4 individuals. Only 1 member of the team needs to fill the form. https://forms.gle/EXVwAntevyexEqYP8. Please join the #sg_hackathon-orgnizrs channel to get the announcements and ask questions. When will results be announced? On Wednesday Acknowledgements We thank Udacity and Facebook for this opportunity For more FAQs, please go to our our github page`'",text data,Hackathon Sentimento,inClass,Sentiment Analysis,categorizationaccuracy,hackathon-sentimento 243,"'`Virtual Hackathon Participate in virtual hackathon for scholars of Secure and Private AI Scholarship Challenge from Facebook conducted by #sghackathonorgnizrs. Come join us for a fun filled weekend of coding and competing against each other. When is it? Hackathon starts => Saturday 00:01am GMT to Monday 11:59am GMT Coding Time => Saturday 00:01am GMT to Sunday 11:59pm GMT . Commiting Kernel => Sunday 00:01am GMT to Monday 11:59am GMT classification How to participate? Use this form to sign up. You can participate alone or as part of a team of up to 4 individuals. Only 1 member of the team needs to fill the form. http://bit.ly/hackathon-signup. Please join the #sg_hackathon-orgnizrs channel to get the announcements and ask questions. 5. When will results be announced? On Wednesday Acknowledgements We thank Udacity and Facebook for this opportunity For more FAQs, please go to our our github page`'",image data,Hackathon Auto_matic,inClass,Cars Dataset,categorizationaccuracy,hackathon-auto_matic 244,'`-`',tabular data,VSU ML 1 Regression,inClass,Учимся решать задачу регрессии и делать EDA,rmse,vsu-ml-1-regression 245,"'`This challenge serves as the first round competition for hackStat 2.0. In this competition you will work with a challenging dataset consisting of data of visitors of a website. Predict the class of the type of customer as to whether the customer would be a revenue generating customer or not, by using the revenue variable as the dependent variable. The rest of the variables would be independent variables. Upload the predicted outcome of the test set according to the format provided, to obtain the accuracy of the prediction. Please find the details of the data and the competition in the email you received after the registrations.By solving this competition you will be able to apply and enhance your data science skills. Good Luck!`'",tabular data,hackStat 2.0,inClass,hackStat 2.0 - First round competitions,categorizationaccuracy,hackstat-2.0 246,"'`Camera Traps (or Wild Cams) enable the automatic collection of large quantities of image data. Biologists all over the world use camera traps to monitor biodiversity and population density of animal species. We have recently been making strides towards automatic species classification in camera trap images. However, as we try to expand the scope of these models we are faced with an interesting problem: how do we train models that perform well on new (unseen during training) camera trap locations? Can we leverage data from other modalities, such as citizen science data and remote sensing data? In order to tackle this problem, we have prepared a challenge where the training data and test data are from different cameras spread across the globe. The set of species seen in each camera overlap, but are not identical. The challenge is to classify species in the test cameras correctly. To explore multimodal solutions, we allow competitors to train on the following data: (i) our camera trap training set (data provided by WCS), (ii) iNaturalist 2017-2019 data, and (iii) multispectral imagery (from Landsat 8) for each of the camera trap locations. On the competition GitHub page we provide the multispectral data, a taxonomy file mapping our classes into the iNat taxonomy, a subset of iNat data mapped into our class set, and a camera trap detection model (the MegaDetector) along with the corresponding detections. If you use this dataset in publication, please cite: @article{beery2020iwildcam, title={The iWildCam 2020 Competition Dataset}, author={Beery, Sara and Cole, Elijah and Gjoka, Arvi}, journal={arXiv preprint arXiv:2004.10340}, year={2020} } This is an FGVCx competition as part of the FGVC7 workshop at CVPR 2020, and is sponsored by Microsoft AI for Earth and Wildlife Insights. There is a GitHub page for the competition here. Please open an issue if you have questions or problems with the dataset. You can find the iWildCam 2018 Competition here, and the iWildCam 2019 Competition here. Kaggle is excited to partner with research groups to push forward the frontier of machine learning. Research competitions make use of Kaggle's platform and experience, but are largely organized by the research group's data science team. Any questions or concerns regarding the competition data, quality, or topic will be addressed by them.`'",image data,iWildCam 2020 - FGVC7,research,Categorize animals in the wild,CategorizationAccuracy,iwildcam-2020-fgvc7 247,"'`UI Data Science Summer School 2019 Halaman ini akan menjadi acuan untuk presentasi tim dan tempat untuk submit pekerjaan peserta.`'",tabular data,UI DS Summer School,inClass,UI Data Science Summer School,rmse,ui-ds-summer-school 248,"'`This is an approximate replica of the 1993 energy use prediction competition run ASHRAE. Experience historic data formats and file sizes without needing to send five inch floppy disks in the mail! We're hosting this as a for-fun companion to the current featured ASHRAE competition Special thanks to Jeff Harberl of the Texas A&M for sharing the data and an impressive act of digital record keeping! Banner photo by Karsten Wrth (@karsten.wuerth) on Unsplash`'",tabular data,Great Energy Predictor Shootout I,inClass,Replica of the original 1993 competition,rmse,great-energy-predictor-shootout-i 249,"'`Can an automated system recommend a funny joke? Content The dataset contains over 1.7 million continuous ratings (-10.00 to +10.00) of 150 jokes from 59,132 users. The ratings are real values ranging from -10.00 to +10.00.`'",text data,Recommender Systems,inClass,Lets Compete to Predict a User's Favorite Joke!,categorizationaccuracy,recommender-systems 250,"'` multiclass metric : f1 micro NLP 7/719:00 7/1419:00 7/1419:00 private:public = 50:50 trainpublic, private submit20 2 word embedding `'",tabular data,YKC-2nd,inClass,YJKC-2nd,meanfscore,ykc-2nd 251,"'`Predict Friends or Not. In this machine learning challenge you are to predict whether two persons are friends or not. You are given a data of recorded events from a group of people and using the given data you have to predict whether they are friends or not. The personal information have been masked for obvious reasons. Safe Coding!`'",tabular data,Who is a Friend?,inClass,Predict whether two persons are friends or not based on their meetings.,categorizationaccuracy,who-is-a-friend? 252,"'`Blinding (or masking) is the process used in experimental research by which study participants, persons caring for the participants, persons providing the intervention, data collectors and data analysts are kept unaware of group assignment (control vs intervention). Blinding aims to reduce the risk of bias that can be caused by an awareness of group assignment. With blinding, outcomes can be attributed to the intervention itself and not influenced by behaviour or assessment of outcomes that can result purely from knowledge of group allocation. Blinding of intervention: the medical treatment method is unknown to the experimenters. Blinding of outcome assessment: the outcome assessment method is unknown to the experimenters. This competition is aimed to find the robust NLP technique for correctly classify the label class of the Blinding of intervention and the Blinding of outcome assessment. Input: Content of Journal Papers Output: The result classes, can be one of PP, PN, PQ, NP, NN, NQ, QP, QN, QQ which is the concatenation of the Blinding of intervention's label class and the Blinding of outcome assessment's label class. Label classes of the blinding of intervention and the Blinding of outcome assessment are defined as follows. P = Positive, the blinding is performed in the experiment. N = Negative, the blinding is NOT performed in the experiment. Q = Question, the blinding is not stated in the paper.`'",text data,HTA Tagging,inClass,Classify the medical academic papers based on biases.,categorizationaccuracy,hta-tagging 253,'`store log()`',tabular data,YKC-cup-1st,inClass,YKC-cup-1st,rmse,ykc-cup-1st 254,"'`How can we use the worlds tools and intelligence to forecast economic outcomes that can never be entirely predictable? This question is at the core of countless economic activities around the world including at Two Sigma Investments, who has been applying technology and systematic strategies to financial trading since 2001. For over 15 years, Two Sigma has been at the forefront of applying technology and data science to financial forecasts. While their pioneering advances in big data, AI, and machine learning in the financial world have been pushing the industry forward, as with all other scientific progress, they are driven to make continual progress. Through this exclusive partnership, Two Sigma is excited to explore what untapped value Kaggle's diverse data science community can discover in the financial markets. Economic opportunity depends on the ability to deliver singularly accurate forecasts in a world of uncertainty. By accurately predicting financial movements, Kagglers will learn about scientifically-driven approaches to unlocking significant predictive capability. Two Sigma is excited to find predictive value and gain a better understanding of the skills offered by the global data science crowd. What is a Code Competition? Welcome to Kaggle's very first Code Competition! In contrast to our traditional competitions, where competitors submit only prediction outputs, participants in Code Competitions will submit their code via Kaggle Kernels. All kernels are private by default in Code Competitions. You can build your models in Kernels by running them on a training set and, once you're ready to submit your code, your model's performance will be evaluated against the test set and your score and public leaderboard position revealed. As with our traditional competitions, we still maintain a private leaderboard test set, which your code is also evaluated against for final scoring, but is not revealed until the competition closes. Since Code Competitions are brand new, we ask for your patience if you encounter bugs or frustrating platform quirks. Please report any issues you find in the forums and we'll do our best to respond. Who owns my code? You do. Even though you are submitting code, the intellectual property exchange here works similarly to a standard prediction competition, whereby prize winners have the option to grant a non-exclusive license in exchange for a prize. There is a new addition to the terms for Code Competitions: Kaggle and the competition host reserve a right to review submissions ""for purposes related to evaluation and scoring in this Competition, including but not limited to the assessment of potential cheating behavior."" Please refer to the official competition rules for full details. Getting Started Review the data page for details about the data and the evaluation metric. You may download the train set for local training. Take a look at the tutorial covering the new code submission process under the submission instructions tab. You'll find step-by-step instructions, some helpful pointers, plus details on environment constraints. Get feedback on your benchmark code and share exploratory analyses with the community by making any of your kernels public. Improve your score! Note: there is no cost of entry for participation.`'",,Two Sigma Financial Modeling Challenge,,Can you uncover predictive value in an uncertain world?,RValue,two-sigma-financial-modeling-challenge 255,"'`August 2019 Update: this competition is closed and is no longer accepting submissions. The data has been removed from this competition and is not available for use. Thanks for participating! Can we use the content of news analytics to predict stock price performance? The ubiquity of data today enables investors at any scale to make better investment decisions. The challenge is ingesting and interpreting the data to determine which data is useful, finding the signal in this sea of information. Two Sigma is passionate about this challenge and is excited to share it with the Kaggle community. As a scientifically driven investment manager, Two Sigma has been applying technology and data science to financial forecasts for over 17 years. Their pioneering advances in big data, AI, and machine learning have pushed the investment industry forward. Now, they're eager to engage with Kagglers in this continuing pursuit of innovation. By analyzing news data to predict stock prices, Kagglers have a unique opportunity to advance the state of research in understanding the predictive power of the news. This power, if harnessed, could help predict financial outcomes and generate significant economic impact all over the world. Data for this competition comes from the following sources: Market data provided by Intrinio. News data provided by Thomson Reuters. Copyright Thomson Reuters, 2017. All Rights Reserved. Use, duplication, or sale of this service, or data contained herein, except as described in the Competition Rules, is strictly prohibited. The THOMSON REUTERS Kinesis Logo and THOMSON REUTERS are trademarks of Thomson Reuters and its affiliated companies in the United States and other countries and used herein under license.`'",,Two Sigma: Using News to Predict Stock Movements,,Use news analytics to predict stock price performance,TwoSigmaNews,two-sigma:-using-news-to-predict-stock-movements 256,'`Can you help prevent diabetes before people get it?`',,Diabetes,inClass,Figure out who will get diabetes!,categorizationaccuracy,diabetes 257,"'`Equifax really needs help, can you create a model that can find the needle in a haystack?`'",,Credit Card Fraud,inClass,Can you help Equifax find compromised information?,categorizationaccuracy,credit-card-fraud 258,'`Please view the associated notebook.`',,Who is rich?,inClass,Predict household's income using regression!,categorizationaccuracy,who-is-rich? 259,'`See the associated notebook for details.`',,Credit Card Fraud SP20,inClass,Whos commiting fraud? Suprisingly its hard to spot...,categorizationaccuracy,credit-card-fraud-sp20 260,'`This is the home page of the UGent Machine learning based NLP Sentiment analysis task.`',,Lab 1 - Sentiment Analysis,inClass,Machine-learning based natural language processing,categorizationaccuracy,lab-1-sentiment-analysis 261,"'`Trzecie zadanie w tym semestrze polega na odszumianiu nieduych zdj. Jako dane traningowe otrzymuj Pastwo pary (zaszumiony_obrazek, czysty_obrazek), w danych testowych maj Pastwo tylko zaszumione wejcie. Standardowym podejciem do tego typu problemw jest uycie autoenkodera. Na wejciu podaje si obrazy zaszumione i otrzymuje czyste obrazy na wyjciu. W standardowym autoenkoderze funkcja rekonstrukcji mierzya odlego midzy oryginalnym obrazem tymsamym,copodanynawejciu i rekonstrukcj. Tutaj bdziemy mierzy odlego rekonstrukcji od obrazu bez szumu. Nieformalnie moemy to rozwizanie sformuowa w nastpujcy sposb. Oryginalne zdjcia bez szumu oznaczmy jako X, a obraz zaszumiony jako (X). Funkcj kosztu bdzie w tym wypadku |E(D((X)))X|2, gdzie E to sie kodujca, a D to sie dekodujca. Zadanie bdzie polega gwnie na dobraniu sieci E i D oraz wybraniu rozmiaru latent space, aby zachowywa ca informacj obrazu potrzebn do rekonstrukcji obrazu. Powinno to w zupenoci wystarczy do pobicia baseline'u. Dalsze ulepszenia mog dotyczy modyfikacji funkcji kosztu i augmentacji danych.`'",,UJ SN2019 Zadanie 3: Odszumianie,inClass,Zadanie 3 dla kursu SN2019,rmse,uj-sn2019-zadanie-3:-odszumianie 262,"'`Climate change has been at the top of our minds and on the forefront of important political decision-making for many years. We hope you can use this competitions dataset to help demystify an important climatic variable. Scientists, like those at Max Planck Institute for Meteorology, are leading the charge with new research on the worlds ever-changing atmosphere and they need your help to better understand the clouds. Shallow clouds play a huge role in determining the Earth's climate. Theyre also difficult to understand and to represent in climate models. By classifying different types of cloud organization, researchers at Max Planck hope to improve our physical understanding of these clouds, which in turn will help us build better climate models. There are many ways in which clouds can organize, but the boundaries between different forms of organization are murky. This makes it challenging to build traditional rule-based algorithms to separate cloud features. The human eye, however, is really good at detecting featuressuch as clouds that resemble flowers. In this challenge, you will build a model to classify cloud organization patterns from satellite images. If successful, youll help scientists to better understand how clouds will shape our future climate. This research will guide the development of next-generation models which could reduce uncertainties in climate projections. Help us remove the haze from climate models and bring clarity to cloud identification. For more information on the scientific background and how the labels were created see the following paper.`'",,Understanding Clouds from Satellite Images,,Can you classify cloud structures from satellites? ,Dice,understanding-clouds-from-satellite-images 263,"'`Around the world, the pool of funds available for research grants is steadily shrinking (in a relative sense). In Australia, success rates have fallen to 20-25 per cent, meaning that most academics are spending valuable time making applications that end up being rejected. With this problem in mind, the University of Melbourne is hosting a competition to predict the success of grant applications. The winning model will be used by the university to predict which grant applications are likely to be successful, so that less time is wasted on applications that are unlikely to succeed. The university hopes the competition will also shed some light on what factors are important in determining whether an application will succeed. The university has provided a dataset containing 249 features, including variables that represent the size of the grant, the general area of study and de-identified information on the investigators who are applying for the grant. Participants train their models on 8,707 grant applications made between 2004 and 2008. They then make predictions on a further 2,176 applications made in 2009 and the first half of 2010. The winner of this competition will receive US$5,000. To be eligible for the prize, the winning method must be implementable by the University of Melbourne. `'",,Predict Grant Applications,featured,This task requires participants to predict the outcome of grant applications for the University of Melbourne. ,AUC,predict-grant-applications 264,"'`Overview In this competition you are tasked with predicting the CPU performance for a set of hardware configurations. Each observation consists of 6 explanatory variables (or features) such as memory size, cache size, and number of memory channels. You are provided with 3 files: X_train.csv the explanatory variables for the training data. Each row corresponds to an observation and each column corresponds to a feature. y_train.csv the ground truth for the training data. The first column is the enumeration of the observations while the second column is the CPU performance. Hence the second column of the n'th row denotes the CPU performance for the n'th observation in X_train.csv. X_test.csv the explanatory variables for the testing data. The structure is the same as for X_train.csv. Your job is to train a model using X_train.csv and y_train.csv and make a prediction for the observations in X_test.csv. Making submissions When you upload your prediction for X_test.csv the answer is compared to the true values (to which students have no access). You then receive a public and a private score based on the accuracy of the prediction. The public score goes on the Leaderboard which tracks the current ranking of the groups. The private score will be kept hidden until the end of the competition where it will be used in determining the final ranking. Your public score thus serves only as an indication of how well you're doing. Each group may submit two answers each day. If an answer yields a better public score than what is currently on the leaderboard for that group, the score is updated. By the end of the competition each group is asked to choose two submissions for the final ranking. The best of these will serve as the groups final score. Refer to the example script on the Data page for a starting point in loading and saving the data.`'",,"UoG-ML-1819, regression",inClass,"Fit your model for the training set, make predictions for the test set, and upload your results in a CSV-file.",rmse,"uog-ml-1819,-regression" 265,"'`This competition aims to regress the value of used cars using some simple factors affecting their value. Description of data The data consists of 70,000 training rows and 30,000 test rows. Each row has an id field. The rating factors in the X datasets are as follows: brand (categorical) year (numeric) age (numeric) engine_size (numeric) power (numeric) mileage (numeric) prev_owners (numeric) The y_train dataset contains the corresponding valuations for each id in the X_train dataset. These can be used to train a model to predict value. The aim is to predict the value of the cars represented by the rows in X_test. Submissions should consist of a two column csv with id and value as the columns.`'",,Used car valuation,inClass,Regression of used car value using commonly available dimensions,mae,used-car-valuation 266,"'`Introduction Understanding how the brain gives rise to intelligence is undeniably one of the greatest challenges facing scientists today. One aspect of intelligence is how the brain encodes sensory inputs (e.g. patterns of light which reach the eye) into a useful representation (e.g. objects with properties like a category). The purpose of this challenge is to see how well this representation can be functionally approximated using any available method. This provides a benchmark for physiological models and a virtual model of the neuron which can be investigated in silico by neuroscientists seeking intuition into these neurons. In addition, predictive models can be experimentally useful in quickly determining the optimal stimuli for investigating complex sensory neurons. To get started quickly go to the 'Kernels' tab and look at the UW Neural Data Challenge notebook. Background The data come from the Pasupathy lab (Department of Biological Structure, University of Washington, Seattle) a primate electrophysiology lab studying how visual input is processed by the brain specifically, how single neurons support primates' rich representation of the visual world. Neurons are specialized cells which generate rapid changes in voltage across their membrane (called spikes). This voltage change causes the release of neurotransmitter onto other neurons which increases or decreases the likelihood that other neurons produce spikes. The Pasupathy lab tries to understand the relationship between the images a monkey is observing and the likelihood of a neuron generating a spike in response to that image. This is done by placing electrodes in the brain and carefully recording the number of spikes a neuron produces in response to images. Neurons in different brain areas are specialized for different levels of stimulus complexity, and neurons within an area respond to very specific stimuli (for instance, faces). The Pasupathy labs focus is on a series of regions in the brain which are thought to be integral in the processing of the category of visual objects (the ventral stream) with the main focus on an area of the brain called 'V4' (https://en.wikipedia.org/wiki/Visual_cortex#V4). This area is an intermediate area in the ventral stream; earlier area of ventral stream seem to represent quite simple visual properties (such as edges) and late areas represent more abstract categories (like faces), whereas V4 is thought to be an intermediate representation transforming the simpler features in earlier areas into complex patterns found in later areas. Acknowledgements We thank the Pasupathy lab for providing this dataset and the Center for Computational Neuroscience UW for providing funding.`'",,UW Neural Data Challenge,,Predict the responses of visual neurons to images.,rmse,uw-neural-data-challenge 267,"'`Eye state detection is the task of predicting the state of eye whether it is open or closed. To achieve this task, a new trend of using brain activity signals by the mean of electroencephalography (EEG) measures for the training and testing of various machine learning classification algorithms was investigated by many researchers. The task of predicting human actions via brain signals takes high importance and usability in various fields such as computer games, health care and bio-medical systems, emotion tracking, smart home device controlling and internet of things, military, and detection of car driving drowsiness. An Emotiv headset device with 14 sensors has been used to record brain signals. The duration time of each recording was 117 seconds. Then, the different eye states observed during each recording were manually added. Each data point consists of 14 EEG features and an eye-state class (either 0 for open, or 1 for closed). The dataset was created by Rsler and Suendermann, which was firstly used by them in ""O. Rsler and D. Suendermann, A first step towards eye state prediction using EEG, Proc. of the AIHLS, 2013""`'",,Eye Blinking Prediction,inClass,CompOmics 2018 summer competition,auc,eye-blinking-prediction 268,"'`When you have a broken arm, radiologists help save the dayand the bone. These doctors diagnose and treat medical conditions using imaging techniques like CT and PET scans, MRIs, and, of course, X-rays. Yet, as it happens when working with such a wide variety of medical tools, radiologists face many daily challenges, perhaps the most difficult being the chest radiograph. The interpretation of chest X-rays can lead to medical misdiagnosis, even for the best practicing doctor. Computer-aided detection and diagnosis systems (CADe/CADx) would help reduce the pressure on doctors at metropolitan hospitals and improve diagnostic quality in rural areas. Existing methods of interpreting chest X-ray images classify them into a list of findings. There is currently no specification of their locations on the image which sometimes leads to inexplicable results. A solution for localizing findings on chest X-ray images is needed for providing doctors with more meaningful diagnostic assistance. Established in August 2018 and funded by the Vingroup JSC, the Vingroup Big Data Institute (VinBigData) aims to promote fundamental research and investigate novel and highly-applicable technologies. The Institute focuses on key fields of data science and artificial intelligence: computational biomedicine, natural language processing, computer vision, and medical image processing. The medical imaging team at VinBigData conducts research in collecting, processing, analyzing, and understanding medical data. They're working to build large-scale and high-precision medical imaging solutions based on the latest advancements in artificial intelligence to facilitate effective clinical workflows. In this competition, youll automatically localize and classify 14 types of thoracic abnormalities from chest radiographs. You'll work with a dataset consisting of 18,000 scans that have been annotated by experienced radiologists. You can train your model with 15,000 independently-labeled images and will be evaluated on a test set of 3,000 images. These annotations were collected via VinBigData's web-based platform, VinLab. Details on building the dataset can be found in our recent paper VinDr-CXR: An open dataset of chest X-rays with radiologist's annotations. If successful, you'll help build what could be a valuable second opinion for radiologists. An automated system that could accurately identify and localize findings on chest radiographs would relieve the stress of busy doctors while also providing patients with a more accurate diagnosis. Acknowledgments Challenge Organizing Team Ha Q. Nguyen, PhD - Vingroup Big Data Institute Hieu H. Pham, PhD - Vingroup Big Data Institute Nhan T. Nguyen, MSc - Vingroup Big Data Institute Dung B. Nguyen, BSc - Vingroup Big Data Institute Minh Dao, PhD - Vingroup Big Data Institute Van Vu, PhD - Vingroup Big Data Institute Khanh Lam, MD, PhD - Hospital 108 Linh T. Le, MD, PhD - Hanoi Medical University Hospital Data Contributors The dataset used in this competition was created by assembling de-identified Chest X-ray studies provided by two hospitals in Vietnam: the Hospital 108 and the Hanoi Medical University Hospital.`'",,VinBigData Chest X-ray Abnormalities Detection,,Automatically localize and classify thoracic abnormalities from chest radiographs,OpenImagesObjectDetectionAP,vinbigdata-chest-x-ray-abnormalities-detection 269,"'`Medium voltage overhead power lines run for hundreds of miles to supply power to cities. These great distances make it expensive to manually inspect the lines for damage that doesn't immediately lead to a power outage, such as a tree branch hitting the line or a flaw in the insulator. These modes of damage lead to a phenomenon known as partial discharge an electrical discharge which does not bridge the electrodes between an insulation system completely. Partial discharges slowly damage the power line, so left unrepaired they will eventually lead to a power outage or start a fire. Your challenge is to detect partial discharge patterns in signals acquired from these power lines with a new meter designed at the ENET Centre at VB. Effective classifiers using this data will make it possible to continuously monitor power lines for faults. ENET Centre researches and develops renewable energy resources with the goal of reducing or eliminating harmful environmental impacts. Their efforts focus on developing technology solutions around transportation and processing of energy raw materials. By developing a solution to detect partial discharge youll help reduce maintenance costs, and prevent power outages.`'",,VSB Power Line Fault Detection,,Can you detect faults in above-ground electrical lines?,MatthewsCorrelationCoefficient,vsb-power-line-fault-detection 270,"'`Walmart uses both art and science to continually make progress on their core mission of better understanding and serving their customers. One way Walmart is able to improve customers' shopping experiences is by segmenting their store visits into different trip types. Whether they're on a last minute run for new puppy supplies or leisurely making their way through a weekly grocery list, classifying trip types enables Walmart to create the best shopping experience for every customer. Currently, Walmart's trip types are created from a combination of existing customer insights (""art"") and purchase history data (""science""). In their third recruiting competition, Walmart is challenging Kagglers to focus on the (data) science and classify customer trips using only a transactional dataset of the items they've purchased. Improving the science behind trip type classification will help Walmart refine their segmentation process. Walmart is hosting this competition to connect with data scientists who break the mold.`'",,Walmart Recruiting: Trip Type Classification,recruitment,Use market basket analysis to classify shopping trips,MulticlassLoss,walmart-recruiting:-trip-type-classification 271,"'`Prerequisites: To complete this project life cycle you have to be good in R or Python because we have the default kernels available in this competition. This business Problem will help you to get some hands-on experience, how to implement Machine learning Algorithms. Project Description: This problem statement is When an item is sold, then what is the probability that customer would file for warranty and to understand important factors associated with them. This is the problem facing by the organizations. To understand whether the claim is a genuine claim or a fraudulent claim based on different independent variables. This dataset is having 19 explanatory variables describing whether the claim is fraud or not.<>/p Skills Needed Extraordinary feature Engineering skills to be implemented Different Classification techniques like Random Forest and Boosting techniques`'",,Warranty Claims,inClass,"To predict an item when sold, what is the probability that customer would file for warranty",categorizationaccuracy,warranty-claims 272,'`s`',,water_test,inClass,test,rmse,water_test 273,"'`Web Enthusiasts' Club Recruitment Test 2018 This contest is being conducted as a part of the recruitment process for the Intelligence Group at Web Enthusiasts' Club NITK. Participants who can improve on the baseline will be shortlisted for interviews. Getting Started Here are a few resources: Numpy Documentation Pandas Documentation Scikit-Learn Documentation Tutorial on Random Forest Classifier`'",,Web Enthusiasts' Club NITK Recruitment,inClass,Official contest for Web Club NITK's Intelligence Group Recruitments for 2019-20 academic year,auc,web-enthusiasts-club-nitk-recruitment 274,"'`This is the home page of the competition. You don't need a subtitle here. The competition sub-title will appear above. This is where you introduce the problem. You can upload images using the ""select files"" widget on the left in the competition wizard. Upload an image, refresh the page, copy its URL, then insert within the wizard's editor. If you are copy-pasting from another application, like Word or your browser, try to make sure the html formatting is clean. You can view a page's html using the button at the top right of the editor's toolbar. This is a subtitle To format pages, stick to the following conventions: Paragraphs should go in p tags Code should go in pre tags Subtitles should go in h2 tags You can display equations using LaTeX enclosed in escaped brackets. For example, this: \[ \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } \] is created by this: \[ \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } \] Acknowledgements We thank Professor Plum, Ph.D. for providing this dataset.`'",,WEC ML Mentorship Contest,inClass,This contest is a 48 hour contest. The data set is a multi class classification problem.,map@{k},wec-ml-mentorship-contest 275,"'`The dataset is from the 1999 KDD Cup. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment. KDD website: https://www.kdd.org/`'",,NITK Web Enthusiast's Club 24 Hour Hackathon,,Revisiting KDD Cup 1999,logloss,nitk-web-enthusiasts-club-24-hour-hackathon 276,"'`Picture yourself strolling through your local, open-air market... What do you see? What do you smell? What will you make for dinner tonight? If you're in Northern California, you'll be walking past the inevitable bushels of leafy greens, spiked with dark purple kale and the bright pinks and yellows of chard. Across the world in South Korea, mounds of bright red kimchi greet you, while the smell of the sea draws your attention to squids squirming nearby. Indias market is perhaps the most colorful, awash in the rich hues and aromas of dozens of spices: turmeric, star anise, poppy seeds, and garam masala as far as the eye can see. Some of our strongest geographic and cultural associations are tied to a region's local foods. This playground competitions asks you to predict the category of a dish's cuisine given a list of its ingredients. Acknowledgements We want to thank Yummly for providing this unique dataset. Kaggle is hosting this playground competition for fun and practice.`'",,What's Cooking? (Kernels Only),,Use recipe ingredients to categorize the cuisine,CategorizationAccuracy,whats-cooking?-(kernels-only) 277,"'`In advance of the March 2, 2020 Global Women in Data Science (WiDS) Conference, we invite you to build a team, hone your data science skills, and join us in a predictive analytics challenge focused on social impact. Register at bit.ly/WiDSdatathon2020! The WiDS Datathon 2020 focuses on patient health through data from MITs GOSSIS (Global Open Source Severity of Illness Score) initiative. Brought to you by the Global WiDS team, the West Big Data Innovation Hub, and the WiDS Datathon Committee, this years datathon is open until February 24, 2020. Winners will be announced at the WiDS Conference at Stanford University and via livestream, reaching a community of 100,000+ data enthusiasts across more than 50 countries. Overview The challenge is to create a model that uses data from the first 24 hours of intensive care to predict patient survival. MIT's GOSSIS community initiative, with privacy certification from the Harvard Privacy Lab, has provided a dataset of more than 130,000 hospital Intensive Care Unit (ICU) visits from patients, spanning a one-year timeframe. This data is part of a growing global effort and consortium spanning Argentina, Australia, New Zealand, Sri Lanka, Brazil, and more than 200 hospitals in the United States. Labeled training data are provided for model development; you will then upload your predictions for unlabeled data to Kaggle and these predictions will be used to determine the public leaderboard rankings, as well as the final winners of the competition. Data analysis can be completed using your preferred tools. Tutorials, sample code, and other resources will be posted throughout the competition at widsconference.org/datathon and on the Kaggle Discussion Forum. The winners will be determined by the leaderboard on the Kaggle platform at the time the contest closes February 24. Who can participate We invite anyone from those new to data science to veterans of the field to participate. For those who have never tried machine learning or worked with health data before, we will release a series of tutorials and webinars to help you get started. The WiDS Datathon aims to inspire women worldwide to learn more about data science, and to create a supportive environment for women to connect with others in their community who share their interests. Toward these ends, we open the datathon to individuals or teams of up to 4; at least half of each team must be women (individuals identifying as female participants). Participants can include students, faculty, and individuals with various roles in non-profit, academic, government, and industry organizations. Acknowledgements The WiDS Datathon 2020 is a collaboration led by the Global WiDS team at Stanford, the West Big Data Innovation Hub, and the WiDS Datathon Committee. Special thanks to the MIT GOSSIS Initiative, the University of Toronto, and the Harvard Data Privacy Lab, as well as our growing community of sponsors and supporters.`'",,WiDS Datathon 2020,,Join the Women in Data Science (WiDS) Datathon 2020,auc,wids-datathon-2020 278,"'`It's another classical data set to predict the quality of wine. Original data sets split into red and white two data set. I have combine these two data sets into one. But you also can split the red and white and train it by yourself. Because it only has 10 level quality from bad to good. You can choose to use either classification or regression model to predict the quality.`'",,Wine Quality Dataset,inClass,Predict the quality of wine,rmse,wine-quality-dataset 279,"'`In this tutorial competition, we dig a little ""deeper"" into sentiment analysis. Google's Word2Vec is a deep-learning inspired method that focuses on the meaning of words. Word2Vec attempts to understand meaning and semantic relationships among words. It works in a way that is similar to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficient. This tutorial focuses on Word2Vec for sentiment analysis. Sentiment analysis is a challenging subject in machine learning. People express their emotions in language that is often obscured by sarcasm, ambiguity, and plays on words, all of which could be very misleading for both humans and computers. There's another Kaggle competition for movie review sentiment analysis. In this tutorial we explore how Word2Vec can be applied to a similar problem. Deep learning has been in the news a lot over the past few years, even making it to the front page of the New York Times. These machine learning techniques, inspired by the architecture of the human brain and made possible by recent advances in computing power, have been making waves via breakthrough results in image recognition, speech processing, and natural language tasks. Recently, deep learning approaches won several Kaggle competitions, including a drug discovery task, and cat and dog image recognition. Tutorial Overview This tutorial will help you get started with Word2Vec for natural language processing. It has two goals: Basic Natural Language Processing: Part 1 of this tutorial is intended for beginners and covers basic natural language processing techniques, which are needed for later parts of the tutorial. Deep Learning for Text Understanding: In Parts 2 and 3, we delve into how to train a model using Word2Vec and how to use the resulting word vectors for sentiment analysis. Since deep learning is a rapidly evolving field, large amounts of the work has not yet been published, or exists only as academic papers. Part 3 of the tutorial is more exploratory than prescriptive -- we experiment with several ways of using Word2Vec rather than giving you a recipe for using the output. To achieve these goals, we rely on an IMDB sentiment analysis data set, which has 100,000 multi-paragraph movie reviews, both positive and negative. Acknowledgements This dataset was collected in association with the following publication: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). ""Learning Word Vectors for Sentiment Analysis."" The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011). (link) Please email the author of that paper if you use the data for any research applications. The tutorial was developed by Angela Chapman during her summer 2014 internship at Kaggle.`'",,Bag of Words Meets Bags of Popcorn,,Use Google's Word2Vec for movie reviews,AUC,bag-of-words-meets-bags-of-popcorn 280,"'`X5 , . , , , - . , , , . . .`'",,Modified Uplift Model for X5,,Build a model to predict who will react to SMS (validation stage),normalizedgini,modified-uplift-model-for-x5 281,"'`Does your favorite Ethiopian restaurant take reservations? Will a first date at that authentic looking bistro break your wallet? Is the diner down the street a good call for breakfast? Restaurant labels help Yelp users quickly answer questions like these, narrowing down their results to only restaurants that fit their nuanced needs. In this competition, Yelp is challenging Kagglers to build a model that automatically tags restaurants with multiple labels using a dataset of user-submitted photos. Currently, restaurant labels are manually selected by Yelp users when they submit a review. Selecting the labels is optional, leaving some restaurants un- or only partially-categorized. In an age of food selfies and photo-centric social storytelling, it may be no surprise to hear that Yelp's users upload an enormous amount of photos every day alongside their written reviews. Can you turn their pictures into (less than a thousand) words? Yelp isnt only looking for your best model; were looking for data mining engineers that can help us use our data in novel ways while pushing code to production. The prize for this competition is a fast track through the recruiting process and an opportunity to show our data mining teams just what youve got! For more information about exciting opportunities at Yelp, check out the Jobs at Yelp competition page and Yelp's own careers page.`'",,Yelp Restaurant Photo Classification,,Predict attribute labels for restaurants using user-submitted photos,MeanFScore,yelp-restaurant-photo-classification 282,"'`Video captures a cross-section of our society. And major advances in analyzing and understanding video have the potential to touch all aspects of life from learning and communication to entertainment and play. In this competition, Google is inviting the Kaggle community to join efforts to accelerate research in large-scale video understanding, while giving participants access to the Google Cloud Machine Learning Engine. Today, one of the greatest obstacles to rapid improvements in video understanding research has been the lack of large-scale, labeled datasets open to the public. For example, the availability of large, labeled datasets such as ImageNet has enabled continued breakthroughs in machine learning and machine perception. To that end, Googles recent release of the YouTube-8M (YT-8M) dataset represents a significant step in this direction. Making this resource open to everyone from students and industry professionals is expected to kickstart innovation in areas such as representation learning and video modeling architectures. In this competition, you are challenged to develop classification algorithms which accurately assign video-level labels using the new and improved YT-8M V2 dataset. The dataset was created from over 7 million YouTube videos (450,000 hours of video) and includes video labels from a vocabulary of 4716 classes (3.4 labels/video on average). It also comes with pre-extracted audio & visual features from every second of video (3.2B feature vectors in total). By taking part, Kagglers will not only play a pivotal role in setting state-of-the-art benchmarks, but also improve search and organization of video archives. Getting Started Review the data page for special instructions on how to access the competition's data. It will be hosted on Google Cloud. Participants have the option to download the data to work locally or work within the Google Cloud ML beta Platform. Review the tutorial on Getting Started with Google Cloud, and try the starter code. Sign up for a Google Cloud ML Platform free trial account. The free trial account includes $300 in credits! We've also provided a subsample of the data to explore on Kernels. Take a look at this Python notebook and create your own. Don't forget to review the prize eligibility details, which includes requirements for code open-sourcing and a paper submission. Because Cloud ML is currently a beta product, Google welcomes the opportunity to hear your feedback about using the tool. Please share your questions and thoughts on the competition's forums. Additional resources specific to the YT-8M dataset and Google Cloud ML can be found here. Acknowledgements Google Cloud Machine Learning, Competition Sponsor Google Cloud Machine Learning is a managed service that enables you to easily build machine learning models, that work on any type of data, of any size. Create your model with the powerful TensorFlow framework that powers many Google products, from GooglePhotos to Google Cloud Speech. Build models of any size with our managed scalable infrastructure. Your trained model is immediately available for use with our global prediction platform that can support thousands of users and TBs of data. The service is integrated with Google Cloud Dataflow for pre-processing, allowing you to access data from Google Cloud Storage, Google BigQuery, and others.`'",,Google Cloud & YouTube-8M Video Understanding Challenge,,Can you produce the best video tag predictions?,GoogleGlobalAP,google-cloud-&-youtube-8m-video-understanding-challenge 283,"'`The world is generating and consuming an enormous amount of video content. Currently on YouTube, people watch over 1 billion hours of video every single day. To spur advances in analyzing and understanding video, Google AI has publicly released a large-scale video dataset that consists of millions of YouTube video features and associated labels from a diverse vocabulary of 3,700+ visual entities called the YouTube-8M Dataset. Last year, we successfully hosted Google Cloud & YouTube-8M Video Understanding Challenge, with 742 participating teams with 946 individual competitors from 60 countries. This competition is the second Kaggle competition based on YouTube 8M dataset, and is focused on learning video representation under budget constraints. For a lot of video tasks where there are a large number of classes, like recommending new videos or automatic video classification, compact models need to meet memory and computational requirements. This is true even if working in cloud computational environments. Also, compact models make it possible to have limited-memory or catalog indexes on devices in order to do personalized and privacy-preserving computation on users personal mobile phones. In this competition, youre challenged to produce a compact video classification model. Your model size must not exceed 1 GB (this is strictly enforced, through model upload). We encourage participants to train a model that most efficiently uses this budget, rather than ensembles of lots of models. This competition is being hosted by Google AI (previously known as Google Research) as a part of the European Conference on Computer Vision (ECCV) 2018 selected workshop session. Please refer to the YouTube 8M Large-Scale Video Understanding Workshop Page for details about the workshop.`'",,The 2nd YouTube-8M Video Understanding Challenge,,Can you create a constrained-size model to predict video labels?,GoogleGlobalAP,the-2nd-youtube-8m-video-understanding-challenge 284,"'`Imagine being able to search for the moment in any video where an adorable kitten sneezes, even though the uploader didnt title or describe the video with such descriptive metadata. Now, apply that same concept to videos that cover important or special events like a babys first steps or a game-winning goal -- and now we have the ability to quickly find and share special video moments. This technology is called temporal concept localization within video and Google Research can use your help to advance the state of the art in this area. An example of the detected action ""blowing out candles"" In most web searches, video retrieval and ranking is performed by matching query terms to metadata and other video-level signals. However, we know that videos can contain an array of topics that arent always characterized by the uploader, and many of these miss localizations to brief but important moments within the video. Temporal localization can enable applications such as improved video search (including search within video), video summarization and highlight extraction, action moment detection, improved video content safety, and many others. In previous years, participants worked on advancements in video-level annotations, building both unconstrained and constrained models. In this third challenge based on the YouTube 8M dataset, Kagglers will localize video-level labels to the precise time in the video where the label actually appears, and do this at an unprecedented scale. To put it another way: at what point in the video does the cat sneeze? If successful, your new machine learning models will significantly improve video understanding for all, by not only identifying the topics relevant to a video, but also pinpointing where in the video they appear. This competition is being hosted by Google Research as a part of the International Conference on Computer Vision (ICCV) 2019 selected workshop session. Please refer to the YouTube 8M Large-Scale Video Understanding Workshop Page for details about the workshop.`'",,The 3rd YouTube-8M Video Understanding Challenge,,Temporal localization of topics within video,YT8M_MeanAveragePrecisionAtK,the-3rd-youtube-8m-video-understanding-challenge 285,"'`Example baseline submissions are available as part of the pylearn2 python package available at https://github.com/lisa-lab/pylearn2 The baseline submissions for this contest are in pylearn2/scripts/icml_2013_wrepl/emotions Because this task is very easy for humans to do, we will not provide the final test inputs until one week before the contest closes. Preliminary winners will need to release their winning code and demonstrate that they did not manually label the test set. We reserve the right to disqualify entries that may involve any manually labeling of the test set. Preliminary winners will need to release their winning code and demonstrate that they did not manually label the test set. We reserve the right to disqualify entries that may involve any manually labeling of the test set.`'",,Challenges in Representation Learning: Facial Expression Recognition Challenge,research,Learn facial expressions from an image,CategorizationAccuracy,challenges-in-representation-learning:-facial-expression-recognition-challenge 286,"'`Open up your pantry and youre likely to find several wheat products. Indeed, your morning toast or cereal may rely upon this common grain. Its popularity as a food and crop makes wheat widely studied. To get large and accurate data about wheat fields worldwide, plant scientists use image detection of ""wheat heads""spikes atop the plant containing grain. These images are used to estimate the density and size of wheat heads in different varieties. Farmers can use the data to assess health and maturity when making management decisions in their fields. However, accurate wheat head detection in outdoor field images can be visually challenging. There is often overlap of dense wheat plants, and the wind can blur the photographs. Both make it difficult to identify single heads. Additionally, appearances vary due to maturity, color, genotype, and head orientation. Finally, because wheat is grown worldwide, different varieties, planting densities, patterns, and field conditions must be considered. Models developed for wheat phenotyping need to generalize between different growing environments. Current detection methods involve one- and two-stage detectors (Yolo-V3 and Faster-RCNN), but even when trained with a large dataset, a bias to the training region remains. The Global Wheat Head Dataset is led by nine research institutes from seven countries: the University of Tokyo, Institut national de recherche pour lagriculture, lalimentation et lenvironnement, Arvalis, ETHZ, University of Saskatchewan, University of Queensland, Nanjing Agricultural University, and Rothamsted Research. These institutions are joined by many in their pursuit of accurate wheat head detection, including the Global Institute for Food Security, DigitAg, Kubota, and Hiphen. In this competition, youll detect wheat heads from outdoor images of wheat plants, including wheat datasets from around the globe. Using worldwide data, you will focus on a generalized solution to estimate the number and size of wheat heads. To better gauge the performance for unseen genotypes, environments, and observational conditions, the training dataset covers multiple regions. You will use more than 3,000 images from Europe (France, UK, Switzerland) and North America (Canada). The test data includes about 1,000 images from Australia, Japan, and China. Wheat is a staple across the globe, which is why this competition must account for different growing conditions. Models developed for wheat phenotyping need to be able to generalize between environments. If successful, researchers can accurately estimate the density and size of wheat heads in different varieties. With improved detection farmers can better assess their crops, ultimately bringing cereal, toast, and other favorite dishes to your table. This is a Code Competition. Refer to Code Requirements for details.`'",,Global Wheat Detection,,Can you help identify wheat heads using image analysis?,custom metric,global-wheat-detection 287,"'`Introduction Google AI (Googles AI research arm, tasked with advancing AI for everyone) is challenging you to build an algorithm that detects objects automatically using an absolutely massive training dataset one with more varied and complex bounding-box annotations and object classes than ever before. Here's the background. Computers are getting better and better at vision. But in a few critical ways, they still can't match a humans intuitive perception. For example, what do you see when you look at this photo? Most of us would answer, a sandy beach, the ocean, a few people walking, some trees, grass, and buildingsa woman walking her dog right there! Oh yeah, and there is a man holding a plastic cup. Can a computer provide as precise an image description? Google AI wants to further push the capabilities of computer vision. We hope that providing very large training set will stimulate research into more sophisticated object and relationship detection models that will exceed current state-of-the-art performance. The results of this Challenge will be presented at a workshop at the European Conference on Computer Vision 2018. Object Detection Track Object detection is a central task in computer vision, with applications ranging across search, robotics, self-driving cars, and many others. As deep network solutions become deeper and more complex, they are often limited by the amount of training data available. With this in mind, to spur advances in analyzing and understanding images, Google AI has publicly released the Open Images dataset. Open Images follows the tradition of PASCAL VOC, ImageNet and COCO, now at an unprecedented scale. The Open Images Challenge is based on Open Images dataset. The training set of the Challenge contains: 12M bounding-box annotations for 500 object classes on 1.7M training images Images of complex scenes with several objectsan average of 7 boxes per image Highly varied images that contain brand new objects like fedora and snowman Class hierarchy that reflects the relationships between classes of Open Images. In this track of the Challenge, you are asked to build the best performing algorithm for automatically detecting objects. Please refer to the Open Images Challenge page for additional details on the dataset. In addition to this Object Detection track, the Challenge also includes a Visual Relationship Detection track to detect pairs of objects in particular relations, e.g. ""woman playing guitar,"" ""beer on table,"" ""dog inside car"", ""man holding coffee"", etc. The Visual Relationship Detection track is available here. Example annotations. Left: Mark Paul Gosselaar plays the guitar by Rhys A. Right: the house by anita kluska. Both images used under CC BY 2.0 license.`'",,Google AI Open Images - Object Detection Track,,Detect objects in varied and complex images.,OpenImagesObjectDetectionAP,google-ai-open-images-object-detection-track 288,"'`Introduction Google AI (Googles AI research arm, tasked with advancing AI for everyone) is challenging you to build an algorithm that detects objects automatically using an absolutely massive training dataset one with more varied and complex bounding-box annotations and object classes than ever before. Here's the background. Computers are getting better and better at vision. But in a few critical ways, they still can't match a humans intuitive perception. For example, what do you see when you look at this photo? Most of us would answer, a sandy beach, the ocean, a few people walking, some trees, grass, and buildingsa woman walking her dog right there! Oh yeah, and there is a man holding a plastic cup. Can a computer provide as precise an image description? Google AI wants to further push the capabilities of computer vision. We hope that providing very large training set will stimulate research into more sophisticated object and relationship detection models that will exceed current state-of-the-art performance. The results of this Challenge will be presented at a workshop at the European Conference on Computer Vision 2018. Visual Relationship Detection Track Identifying different objects (man and cup) is an important problem on its own, but identifying the relationship between them (holding) is critical for many real world use cases. In this Visual Relationship Detection track Challenge youre asked to build an algorithm that detects pairs of objects in particular relations: things like ""woman playing guitar,"" ""beer on table,"" or ""dog inside car."" The Challenge dataset includes both object bounding boxes and visual relationship annotations. The training set contains annotations for 329 distinct relationship triplets, occurring a total of 374,768 times. In this track of the Challenge, you are asked to build the best performing algorithm for automatically detecting relationships triplets. Please refer to the Open Images Challenge page for additional details on the dataset. This competition is one of two tracks in the Open Images Challenge. Find the Object Detection track of this competition using the entire training set here. Example of man playing guitar Radiofiera - Villa Cordellina Lombardi, Montecchio Maggiore (VI) - agosto 2010 by Andrea Sartorati Example of chair at table Epic Fireworks - Loads A Room by Epic Fireworks`'",,Google AI Open Images - Visual Relationship Track,,Detect pairs of objects in particular relationships.,OpenImagesVisualRelations,google-ai-open-images-visual-relationship-track 289,"'`Manchester City F.C. and Google Research are proud to present AI football competition using the Google Research Football Environment. A word from Manchester City F.C. Brian Prestidge, Director of Data Insights & Decision Technology at City Football Group, the owners of Manchester City F.C., sets out the challenge. Football is a tough environment to perform in and an even tougher environment to learn in. Learning is all about harnessing failure, but failure in football is seldom accepted. Working with Google Researchs physics based football environment provides us with a new place to learn through simulation and offers us the capabilities to test tactical concepts and refine principles so that they are strong enough for a coach to stake their career on. We are therefore very pleased to be working with Googles research team in creating this competition and are looking forward to the opportunity to support some of the most creative and successful competitors through funding and exclusive prizes. We hope to establish ongoing collaboration with the winners beyond this competition, and that it will provide us all with the platform to explore and establish fundamental principles of football tactics, thus improving our ability to perform and be successful on the pitch. Greg Swimer, Chief Technology Officer at City Football Group added ""Technologies such as Machine Learning and Artificial Intelligence have huge future potential to enhance the understanding and enjoyment of football for players, coaches and fans. We are delighted to be collaborating with Google's research team to help broaden the knowledge, talent, and innovation working in this exciting and transformational area"". The Google Research football environment competition The world gets a kick out of football (soccer in the United States). As the most popular sport on the planet, millions of fans enjoy watching Sergio Agero, Raheem Sterling, and Kevin de Bruyne on the field. Football video games are less lively, but still immensely popular, and we wonder if AI agents would be able to play those properly. Researchers want to explore AI agents' ability to play in complex settings like football. The sport requires a balance of short-term control, learned concepts such as passing, and high-level strategy, which can be difficult to teach agents. A current environment exists to train and test agents, but other solutions may offer better results. The teams at Google Research aspire to make discoveries that impact everyone. Essential to their approach is sharing research and tools to fuel progress in the field. Together with Manchester City F.C., Google Research has put forth this competition to get help in reaching their goal. In this competition, youll create AI agents that can play football. Teams compete in steps, where agents react to a game state. Each agent in an 11 vs 11 game controls a single active player and takes actions to improve their teams situation. As with a typical football game, you want your team to score more than the other side. You can optionally see your efforts rendered in a physics-based 3D football simulation. If controlling 11 football players with code sounds difficult, don't be discouraged! You only need to control one player at a time (the one with the ball on offense, or the one closest to the ball on defense) and your code gets to pick from 1 of 19 possible actions. We have prepared a getting started example to show you how simple a basic strategy can be. Before implementing your own strategy, however, you might want to learn more about the Google Research football environment, especially observations provided to you by the environment and available actions. You can also play the game yourself on your computer locally to get better understanding of the environment's dynamics and explore different scenarios. If successful, you'll help researchers explore the ability of AI agents to play in complex settings. This could offer new insights into the strategies of the world's most-watched sport. Additionally, this research could pave the way for a new generation of AI agents that can be trained to learn complex skills.`'",,Google Research Football with Manchester City F.C.,featured,Train agents to master the world's most popular sport,football,google-research-football-with-manchester-city-f.c. 290,"'`Computers are really good at answering questions with single, verifiable answers. But, humans are often still better at answering questions about opinions, recommendations, or personal experiences. Humans are better at addressing subjective questions that require a deeper, multidimensional understanding of context - something computers aren't trained to do wellyet.. Questions can take many forms - some have multi-sentence elaborations, others may be simple curiosity or a fully developed problem. They can have multiple intents, or seek advice and opinions. Some may be helpful and others interesting. Some are simple right or wrong. Unfortunately, its hard to build better subjective question-answering algorithms because of a lack of data and predictive models. Thats why the CrowdSource team at Google Research, a group dedicated to advancing NLP and other types of ML science via crowdsourcing, has collected data on a number of these quality scoring aspects. In this competition, youre challenged to use this new dataset to build predictive algorithms for different subjective aspects of question-answering. The question-answer pairs were gathered from nearly 70 different websites, in a ""common-sense"" fashion. Our raters received minimal guidance and training, and relied largely on their subjective interpretation of the prompts. As such, each prompt was crafted in the most intuitive fashion so that raters could simply use their common-sense to complete the task. By lessening our dependency on complicated and opaque rating guidelines, we hope to increase the re-use value of this data set. What you see is what you get! Demonstrating these subjective labels can be predicted reliably can shine a new light on this research area. Results from this competition will inform the way future intelligent Q&A systems will get built, hopefully contributing to them becoming more human-like.`'",,Google QUEST Q&A Labeling,,Improving automated understanding of complex question answer content,MCSpearmanR,google-quest-q&a-labeling 291,"'`Think back to this morning: turning off the alarm, getting dressed, brushing your teeth, making coffee, drinking coffee, and locking the door as you left for work. Now imagine doing all those things again, without the use of your hands. Patients who have lost hand function due to amputation or neurological disabilities wake up to this reality everyday. Restoring a patient's ability to perform these basic activities of daily life with a brain-computer interface (BCI) prosthetic device would greatly increase their independence and quality of life. Currently, there are no realistic, affordable, or low-risk options for neurologically disabled patients to directly control external prosthetics with their brain activity. Recorded from the human scalp, EEG signals are evoked by brain activity. The relationship between brain activity and EEG signals is complex and poorly understood outside of specific laboratory tests. Providing affordable, low-risk, non-invasive BCI devices is dependent on further advancements in interpreting EEG signals. This competition challenges you to identify when a hand is grasping, lifting, and replacing an object using EEG data that was taken from healthy subjects as they performed these activities. Better understanding the relationship between EEG signals and hand movements is critical to developing a BCI device that would give patients with neurological disabilities the ability to move through the world with greater autonomy. Acknowledgements This competition is sponsored by the WAY Consortium (Wearable interfaces for hAnd function recoverY; FP7-ICT-288551).`'",,Grasp-and-Lift EEG Detection,,Identify hand motions from EEG recordings,MCAUC,grasp-and-lift-eeg-detection 292,"'`If you don't know the Guess The Correlation Game, go there and play it for a while. If you don't know what (Linear) Correlation is, go to wiki to have some clue about what it is. The goal is: build a function that, given an image containing of a scatterplot, it returns the respective linear correlation coefficient of it. Have fun!`'",,Guess the correlation,inClass,Help robots to beat humans at Guess Correlation Game,rmse,guess-the-correlation 293,"'`Note: This simulation is a playground competition extending the fourth season of Halite for participation. We have modified the rules to serve as a two-player game instead of four-player game. No points or medals will be awarded for this competition. Ahoy there! There's halite to be had and ships to be deployed! Are you ready to navigate the skies and secure your territory? Halite by Two Sigma (""Halite"") is a resource management game where you build and control a small armada of ships. Your algorithms determine their movements to collect halite, a luminous energy source. The most halite at the end of the match wins, but it's up to you to figure out how to make effective and efficient moves. You control your fleet, build new ships, create shipyards, and mine the regenerating halite on the game board. Created by Two Sigma in 2016, more than 15,000 people around the world have participated in a Halite challenge. Players apply advanced algorithms in a dynamic, open source game setting. The strategic depth and immersive, interactive nature of Halite games make each challenge a unique learning environment. Halite IV builds on the core game design of Halite III with a number of key changes that shift the focus of the game towards tighter competition on a smaller board. New game features include regenerating halite, shipyard creation, no more ship movement costs, and stealing halite from other players! So dust off your halite meters and fasten your seatbelts. The fourth season of Halite is about to begin!`'",,Halite by Two Sigma - Playground Edition,,Collect the most halite during your match in space,halite,halite-by-two-sigma-playground-edition 294,"'`Welcome to the Data Science Guild's first Datathon Competition Description MNIST (""Modified National Institute of Standards and Technology"") is the hello world dataset of computer vision. It was released in 1999, and since then it has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike. In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. This dataset is not the real MNIST data but is quite similar. We encourage you to experiment with different algorithms to learn first-hand what works well and how techniques compare. You only have 20 submissions for each day. So use them well! Beware, the data is noisy!`'",,Handwritten digit recognition (PITC),,The Data Science Guild presents the first hackathon - A twist to MNIST,categorizationaccuracy,handwritten-digit-recognition-(pitc) 295,"'`This is a synthetic code challenge to sharpen your programming skills. This problem was first released during the 2016 qualification round of Google's annual coding competition, Hash Code. Weve re-released it as a Playground Code Competition to help you sharpen your skills. Along with the Photo Slideshow Optimization competition, open for late submissions, you can use it as practice in advance of Hash Code 2021. The Internet has profoundly changed the way we buy things, but the online shopping of today is likely not the end of that change; the expectations for purchase delivery has gone from a week, to two days, to one day, to same day. What about in just a few hours? With drones, this may be possible, and theyll bring a whole new fleet of problems to solve with data science. Drones are autonomous, electric vehicles often used to deliver online purchases. Current experiments use flying drones, so theyre never stuck in traffic. As drone technology improves every year, there remains a major issue: how would we manage and coordinate all those drones? In this competition, you are given a hypothetical fleet of drones, a list of customer orders, and availability of the individual products in warehouses. Can you schedule the drone operations so that the orders are completed as soon as possible? When flying delivery drones become the norm, scheduling is one of the many problems to be solved. Get a head startand improve your data science skills at the same time. This is a Code Competition. Refer to Code Requirements for details. Photo by Ian Usher on Unsplash`'",,Hash Code Archive - Drone Delivery,,Can you help coordinate the drone delivery supply chain?,PostProcessorKernelDesc,hash-code-archive-drone-delivery 296,"'`Note: Put your heads together to solve programming challenges. Google's coding competition, Hash Code, has just finished for 2020. Use this online qualifier from 2019 to keep your skills sharp for future competitions! As the saying goes, ""a picture is worth a thousand words."" We agree photos are an important part of contemporary digital and cultural life. How we experience photos largely depends on the story theyre arranged to tell. The same shots could be a monotonous series of snaps or form a narrative masterpiece. Approximately 2.5 billion people around the world carry a camera in the form of a smartphone in their pocket every day. We tend to make good use of it, too, taking more photos than ever (back in 2017, Google Photos announced it was backing up more than 1.2 billion photos and videos per day)! The rise of digital photography creates an interesting challenge: what should we do with all of these photos? In this competition, you will compose a slideshow out of a photo collection. Given a list of photos and the tags associated with each photo, you are challenged to arrange the photos into a slideshow that is as interesting as possible (the evaluation section explains what we mean by interesting) Will your slideshow tell a good story or be a major snoozefest?`'",,Hash Code Archive - Photo Slideshow Optimization,,Optimizing a photo album from Hash Code 2019,PostProcessorKernelDesc,hash-code-archive-photo-slideshow-optimization 297,"'`Explaining the meaning of varibles in the dataset 1.Car_ID:Unique id of each observation (Interger) 2.Symboling :Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.(Categorical) 3.carCompany:Name of car company (Categorical) 4.fueltype:Car fuel type i.e gas or diesel (Categorical) 5.aspiration:Aspiration used in a car (Categorical) 6.doornumber:Number of doors in a car (Categorical) 7.carbody:body of car (Categorical) 8.drivewheel:type of drive wheel (Categorical) 9.enginelocation:Location of car engine (Categorical) 10.wheelbase:Weelbase of car (Numeric) 11.carlength:Length of car (Numeric) 12.carwidth:Width of car (Numeric) 13.carheight:height of car (Numeric) 14.curbweight:The weight of a car without occupants or baggage. (Numeric) 15.enginetype:Type of engine. (Categorical) 16.cylindernumber:cylinder placed in the car (Categorical) 17.enginesize:Size of car (Numeric) 18.fuelsystem:Fuel system of car (Categorical) 19.boreratio:Boreratio of car (Numeric) 20.stroke:Stroke or volume inside the engine (Numeric) 21.compressionratio:compression ratio of car (Numeric) 22.horsepower:Horsepower (Numeric) 23.peakrpm:car peak rpm (Numeric) 24.citympg:Mileage in city (Numeric) 25.highwaympg:Mileage on highway (Numeric) 26.price(Dependent variable):Price of car (Numeric)`'",,Hello Kaggle F464,,A test evaluative lab to get you accustomed to the process.,rmse,hello-kaggle-f464 298,"'`The Herbarium 2020 FGVC7 Challenge is to identify vascular plant species from a large, long-tailed collection herbarium specimens provided by the New York Botanical Garden (NYBG). The Herbarium 2020 dataset contains over 1M images representing over 32,000 plant species. This is a dataset with a long tail; there are a minimum of 3 specimens per species. However, some species are represented by more than a hundred specimens. This dataset only contains vascular land plants which includes lycophytes, ferns, gymnosperms, and flowering plants. The extinct forms of lycophytes are the major component of coal deposits, ferns are indicators of ecosystem health, gymnosperms provide major habitats for animals, and flowering plants provide all of our crops, vegetables, and fruits. The teams with the most accurate models will be contacted, with the intention of using them on the un-named plant collections in the NYBG herbarium collection, and assessed by the NYBG plant specialists. Background The New York Botanical Garden (NYBG) herbarium contains more than 7.8 million plant and fungal specimens. Herbaria are a massive repository of plant diversity data. These collections not only represent a vast amount of plant diversity, but since herbarium collections include specimens dating back hundreds of years, they provide snapshots of plant diversity through time. The integrity of the plant is maintained in herbaria as a pressed, dried specimen; a specimen collected nearly two hundred years ago by Darwin looks much the same as one collected a month ago by an NYBG botanist. All specimens not only maintain their morphological features but also include collection dates and locations, and the name of the person who collected the specimen. This information, multiplied by millions of plant collections, provides the framework for understanding plant diversity on a massive scale and learning how it has changed over time. About This is an FGVC competition hosted as part of the FGVC7 workshop at CVPR 2020 and sponsored by NYBG. Details of this competition are mirrored on the github page. Please post in the forum or open an issue if you have any questions or problems with the dataset.`'",,Herbarium 2020 - FGVC7,,Identify plant species from herbarium specimens. Data from New York Botanical Garden.,MacroFScore,herbarium-2020-fgvc7 299,"'`This contest focuses on using the nucleotide sequence of the Reverse Transcriptase (RT) and Protease (PR) to predict the patient's short-term progression. For the non-Biologist: the nucleotide sequence is the blueprint of the protein, which is the workhorse of the cell. The RT enzyme is responsible for copying the HIV-1 genome within the cell. As the HIV-1 genome is translated it is in one long string of amino acids; the PR protein cuts this string into the numerous functional units - required by the HIV life-cycle. These are the proteins that are targeted by most HIV-1 drugs since they are mostly unique to the HIV-1 life-cycle. Along with the HIV-1 viral sequences I have provided the two common clinical indicators used to determine the ""general health"" of an HIV-1 infected individual: Viral Load and CD4+ cell counts. The CD4+ cell count is an estimate of the number of white-blood-cells in 1 mL of blood while the viral load is the number of viral particles in that same mL. In this dataset the viral load is represented in a log-10 scale. The higher the number the more ""active"" the immune system. Paradoxically higher CD4 counts imply both a healthier individual but also a higher amount of viral reproduction (the virus primarily replicates in CD4 cells). If you're interested in learning more about the HIV lifecycle and HIV treatments, here are some extra resources: http://en.wikipedia.org/wiki/HIV http://www.youtube.com/watch?v=RO8MP3wMvqg&feature=related http://www.hiv.lanl.gov/content/sequence/HIV/HIVTools.html http://en.wikipedia.org/wiki/HIV_therapy#Treatment`'",,Predict HIV Progression,featured,"This contest requires competitors to predict the likelihood that an HIV patient's infection will become less severe, given a small dataset and limited clinical information. ",MCE,predict-hiv-progression 300,"'`Task description We are going to use the Boston Housing dataset. You are supposed to teach a model to predict the MEDV (median value) from the other features. The train.csv file contains the MEDV value, but the test.csv doesn't. Export your predictions for MEDV into a csv file, following the format in the provided sampleSubmission.csv file. The predictions will be scored against a held-out dataset. How to do it You can use whatever platform you want for training models - your own laptop, Google Colaboratory, or (perhaps the easiest option) a notebook or script hosted here on Kaggle. The advantage of the latter approach is that all relevant libraries, like xgboost and lightgbm, are already pre-installed on Kaggle. You can even use a GPU! (Which would be total overkill for this small dataset, but anyway.) You might want to start from the ""sample_code"" notebook. Cheating It would be very easy to cheat on this task, given that it uses a public dataset. Please don't do that, but rather try to generate the best model you can, for the sake of your own learning.`'",,Boston Housing Data,inClass,In-house competition for tree ensemble workshop,rmse,boston-housing-data 301,"'`Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders. Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities. While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.`'",,Home Credit Default Risk,,Can you predict how capable each applicant is of repaying a loan?,AUC,home-credit-default-risk 302,"'`Shoppers rely on Home Depots product authority to find and buy the latest products and to get timely solutions to their home improvement needs. From installing a new ceiling fan to remodeling an entire kitchen, with the click of a mouse or tap of the screen, customers expect the correct results to their queries quickly. Speed, accuracy and delivering a frictionless customer experience are essential. In this competition, Home Depot is asking Kagglers to help them improve their customers' shopping experience by developing a model that can accurately predict the relevance of search results. Search relevancy is an implicit measure Home Depot uses to gauge how quickly they can get customers to the right products. Currently, human raters evaluate the impact of potential changes to their search algorithms, which is a slow and subjective process. By removing or minimizing human input in search relevance evaluation, Home Depot hopes to increase the number of iterations their team can perform on the current search algorithms.`'",,Home Depot Product Search Relevance,,Predict the relevance of search results on homedepot.com,RMSE,home-depot-product-search-relevance 303,"'`Welcome to the class project for our 2020 Hands On Machine Learning class! This is one of two private Kaggle competitions, of which the other can be found here. In the Data tab, you can find the file to download for this project, including a notebook to help you started. After going through the notebook, you'll create a submissions.csv file, which you'll be able to submit to this Kaggle competition for scoring and a place on the leaderboard. You can make up to 20 submissions a day. For this competition, we will be working with the IMDB review dataset. This is a dataset of highly polar movie reviews and our goal is to perform sentiment classification to figure out how the reviewer was feeling. The data is structured as a csv, with each row containing a review. Here is a description of the data: Review sentiment (review text) positive/negative One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me. positive This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. negative`'",,HOML Class Project: IMDB Challenge,,Sentiment analysis on IMDB reviews!,f_{beta},homl-class-project:-imdb-challenge 304,'`Competition for predict house price`',,House price predict,inClass,Predict house price for teamer practising sgd,rmse,house-price-predict 305,"'`Start here if... You have some experience with R or Python and machine learning basics. This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition. Competition Description Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. Practice Skills Creative feature engineering Advanced regression techniques like random forest and gradient boosting Acknowledgments The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. Photo by Tom Thain on Unsplash.`'",,House Prices - Advanced Regression Techniques,,"Predict sales prices and practice feature engineering, RFs, and gradient boosting",rmsle,house-prices-advanced-regression-techniques 306,"'`For agriculture, it is extremely important to know how much it rained on a particular field. However, rainfall is variable in space and time and it is impossible to have rain gauges everywhere. Therefore, remote sensing instruments such as radar are used to provide wide spatial coverage. Rainfall estimates drawn from remotely sensed observations will never exactly match the measurements that are carried out using rain gauges, due to the inherent characteristics of both sensors. Currently, radar observations are ""corrected"" using nearby gauges and a single estimate of rainfall is provided to users who need to know how much it rained. This competition will explore how to address this problem in a probabilistic manner. Knowing the full probabilistic spread of rainfall amounts can be very useful to drive hydrological and agronomic models -- much more than a single estimate of rainfall. Image courtesy of NOAA Unlike a conventional Doppler radar, a polarimetric radar transmits radio wave pulses that have both horizontal and vertical orientations. Because rain drops become flatter as they increase in size and because ice crystals tend to be elongated vertically, whereas liquid droplets tend to be flattened, it is possible to infer the size of rain drops and the type of hydrometeor from the differential reflectivity of the two orientations. In this competition, you are given polarimetric radar values and derived quantities at a location over the period of one hour. You will need to produce a probabilistic distribution of the hourly rain gauge total. More details are on the data page. This competition is sponsored by the Artificial Intelligence Committee of the American Meteorological Society. The Climate Corporation has kindly agreed to sponsor the prizes.`'",,How Much Did It Rain?,research,Predict probabilistic distribution of hourly rain given polarimetric radar measurements,CRPS,how-much-did-it-rain? 307,"'`After incorporating feedback from the Kaggle community, as well as scientific and educational partners, the Artificial Intelligence Committee of the American Meteorological Society is excited to be running a second iteration of the How Much Did It Rain? competition. How Much Did It Rain? II is focused on solving the same core rain measurement prediction problem, but approaches it with a new and improved dataset and evaluation metric. This competition will go even further towards building a useful educational tool for universities, as well as making a meaningful contribution to continued meteorological research. Competition Description Rainfall is highly variable across space and time, making it notoriously tricky to measure. Rain gauges can be an effective measurement tool for a specific location, but it is impossible to have them everywhere. In order to have widespread coverage, data from weather radars is used to estimate rainfall nationwide. Unfortunately, these predictions never exactly match the measurements taken using rain gauges. Recently, in an effort to improve their rainfall predictors, the U.S. National Weather Service upgraded their radar network to be polarimetric. These polarimetric radars are able to provide higher quality data than conventional Doppler radars because they transmit radio wave pulses with both horizontal and vertical orientations. Dual pulses make it easier to infer the size and type of precipitation because rain drops become flatter as they increase in size, whereas ice crystals tend to be elongated vertically. In this competition, you are given snapshots of polarimetric radar values and asked to predict the hourly rain gauge total. A word of caution: many of the gauge values in the training dataset are implausible (gauges may get clogged, for example). More details are on the data page. Acknowledgements This competition is sponsored by the Artificial Intelligence Committee of the American Meteorological Society. Climate Corporation is providing the prize pool.`'",,How Much Did It Rain? II,,Predict hourly rainfall using data from polarimetric radars,MAE,how-much-did-it-rain?-ii 308,"'`There are billions of humans on this earth, and each of us is made up of trillions of cells. Just like every individual is unique, even genetically identical twins, scientists observe differences between the genetically identical cells in our bodies. Differences in the location of proteins can give rise to such cellular heterogeneity. Proteins play essential roles in virtually all cellular processes. Often, many different proteins come together at a specific location to perform a task, and the exact outcome of this task depends on which proteins are present. As you can imagine, different subcellular distributions of one protein can give rise to great functional heterogeneity between cells. Finding such differences, and figuring out how and why they occur, is important for understanding how cells function, how diseases develop, and ultimately how to develop better treatments for those diseases. To see more, start with less. That may seem counterintuitive, but the study of a single cell enables the discovery of mechanisms too difficult to see with multi-cell research. The importance of studying single cells is reflected in the ongoing revolution in biology centered around technologies for single cell analysis. Microscopy offers an opportunity to study differences in protein localizations within a population of cells. Current machine learning models for classifying protein localization patterns in microscope images gives a summary of the entire population of cells. However, the single-cell revolution in biology demands models that can precisely classify patterns in each individual cell in the image. The Human Protein Atlas is an initiative based in Sweden that is aimed at mapping proteins in all human cells, tissues, and organs. The data in the Human Protein Atlas database is freely accessible to scientists all around the world that allows them to explore the cellular makeup of the human body. Solving the single-cell image classification challenge will help us characterize single-cell heterogeneity in our large collection of images by generating more accurate annotations of the subcellular localizations for thousands of human proteins in individual cells. Thanks to you, we will be able to more accurately model the spatial organization of the human cell and provide new open-access cellular data to the scientific community, which may accelerate our growing understanding of how human cells functions and how diseases develop. This is a weakly supervised multi-label classification problem and a code competition. Given images of cells from our microscopes and labels of protein location assigned together for all cells in the image, Kagglers will develop models capable of segmenting and classifying each individual cell with precise labels. If successful, you'll contribute to the revolution of single-cell biology! The scientific journal Nature Methods is interested in considering a paper discussing the outcome and approaches of the challenge. The Human Protein Atlas team, led by Professor Emma Lundberg, would like to invite top performing teams to join as co-authors in writing this paper. Please follow the discussion forum for more details on how you can help. This is a Code Competition. Refer to Code Requirements for details.`'",,Human Protein Atlas - Single Cell Classification,,Find individual human cell differences in microscope images,OpenImagesObjDetectionSegmentationAP,human-protein-atlas-single-cell-classification 309,"'`Our best estimates show there are over 7 billion people on the planet and 300 billion stars in the Milky Way galaxy. By comparison, the adult human body contains 37 trillion cells. To determine the function and relationship among these cells is a monumental undertaking. Many areas of human health would be impacted if we better understand cellular activity. A problem with this much data is a great match for the Kaggle community. Just as the Human Genome Project mapped the entirety of human DNA, the Human BioMolecular Atlas Program (HuBMAP) is a major endeavor. Sponsored by the National Institutes of Health (NIH), HuBMAP is working to catalyze the development of a framework for mapping the human body at a level of glomeruli functional tissue units for the first time in history. Hoping to become one of the worlds largest collaborative biological projects, HuBMAP aims to be an open map of the human body at the cellular level. This competition, Hacking the Kidney,"" starts by mapping the human kidney at single cell resolution. Your challenge is to detect functional tissue units (FTUs) across different tissue preparation pipelines. An FTU is defined as a three-dimensional block of cells centered around a capillary, such that each cell in this block is within diffusion distance from any other cell in the same block (de Bono, 2013). The goal of this competition is the implementation of a successful and robust glomeruli FTU detector. You will also have the opportunity to present your findings to a panel of judges for additional consideration. Successful submissions will construct the tools, resources, and cell atlases needed to determine how the relationships between cells can affect the health of an individual. Advancements in HuBMAP will accelerate the worlds understanding of the relationships between cell and tissue organization and function and human health. These datasets and insights can be used by researchers in cell and tissue anatomy, pharmaceutical companies to develop therapies, or even parents to show their children the magnitude of the human body. This is a Code Competition. Refer to Code Requirements for details.`'",,HuBMAP - Hacking the Kidney,,Identify glomeruli in human kidney tissue images,Dice,hubmap-hacking-the-kidney 310,"'`In this competition, Kagglers will develop models capable of classifying mixed patterns of proteins in microscope images. The Human Protein Atlas will use these models to build a tool integrated with their smart-microscopy system to identify a protein's location(s) from a high-throughput image. Proteins are the doers in the human cell, executing many functions that together enable life. Historically, classification of proteins has been limited to single patterns in one or a few cell types, but in order to fully understand the complexity of the human cell, models must classify mixed patterns across a range of different human cells. Images visualizing proteins in cells are commonly used for biomedical research, and these cells could hold the key for the next breakthrough in medicine. However, thanks to advances in high-throughput microscopy, these images are generated at a far greater pace than what can be manually evaluated. Therefore, the need is greater than ever for automating biomedical image analysis to accelerate the understanding of human cells and disease. Nature Methods has indicated interest in considering a paper discussing the outcome and approaches of the challenge. The Human Protein Atlas team would like to invite top performing teams to join as co-authors in the writing of this paper. Top performing teams will also be eligible to compete for the special prize. Additional information for both the special prize and co-authoring for Nature Methods will become available through the Discussion posts once the main competition is complete. Acknowledgements The Human Protein Atlas is a Sweden-based initiative aimed at mapping all human proteins in cells, tissues and organs. All the data in the knowledge resource is open access to allow anyone to pursue exploration of the human proteome. In a recent publication, the Human Protein Atlas team has demonstrated the promise of both citizen science and artificial intelligence approaches in describing the location of human proteins in images, however current results are yet to approach expert-level annotations (Sullivan et al, Nature Biotechnology, Oct 2018).`'",,Human Protein Atlas Image Classification,,Classify subcellular protein patterns in human cells,MacroFScore,human-protein-atlas-image-classification 311,"'`After centuries of intense whaling, recovering whale populations still have a hard time adapting to warming oceans and struggle to compete every day with the industrial fishing industry for food. To aid whale conservation efforts, scientists use photo surveillance systems to monitor ocean activity. They use the shape of whales tails and unique markings found in footage to identify what species of whale theyre analyzing and meticulously log whale pod dynamics and movements. For the past 40 years, most of this work has been done manually by individual scientists, leaving a huge trove of data untapped and underutilized. In this competition, youre challenged to build an algorithm to identify individual whales in images. Youll analyze Happywhales database of over 25,000 images, gathered from research institutions and public contributors. By contributing, youll help to open rich fields of understanding for marine mammal population dynamics around the globe. Note, this competition is similar in nature to this competition with an expanded and updated dataset. We'd like to thank Happywhale for providing this data and problem. Happywhale is a platform that uses image process algorithms to let anyone to submit their whale photo and have it automatically identified.`'",,Humpback Whale Identification,,Can you identify a whale by its tail?,MAP@{K},humpback-whale-identification 312,"'`Whether it be in an arcade, on a phone, as an app, on a computer, or maybe stumbled upon in a web search, many of us have likely developed fond memories playing a version of Snake. Its addicting to control a slithering serpent and watch it grow along the grid until you make one wrong move. Then you have to try again because surely you wont make the same mistake twice! With Hungry Geese, Kaggle has taken this classic in the video game industry and put a multi-player, simulation spin to it. You will create an AI agent to play against others and survive the longest. You must make sure your goose doesnt starve or run into other geese; its a good thing that geese love peppers, donuts, and pizzawhich show up across the board. Extensive research exists in building Snake models using reinforcement learning, Q-learning, neural networks, and more (maybe youll use Python?). Take your grid-based reinforcement learning knowledge to the next level with this exciting new challenge!`'",,Hungry Geese,,Don't. Stop. Eating.,hungry_geese,hungry-geese 313,"'`Objetivo Aplicar los conceptos vistos en las prcticas de Aprendizaje Automtico de la asignatura de Inteligencia Artificial en la resolucin de una aplicacin prctica, en el entorno de una competicin, acogida en el entorno Kaggle, en la que cada estudiante intentar obtener la mejor puntuacin.`'",,IA1920,inClass,Entregable 1 de la asignatura Inteligencia Artificial curso 19/20,categorizationaccuracy,ia1920 314,"'`Imagine standing at the check-out counter at the grocery store with a long line behind you and the cashier not-so-quietly announces that your card has been declined. In this moment, you probably arent thinking about the data science that determined your fate. Embarrassed, and certain you have the funds to cover everything needed for an epic nacho party for 50 of your closest friends, you try your card again. Same result. As you step aside and allow the cashier to tend to the next customer, you receive a text message from your bank. Press 1 if you really tried to spend $500 on cheddar cheese. While perhaps cumbersome (and often embarrassing) in the moment, this fraud prevention system is actually saving consumers millions of dollars per year. Researchers from the IEEE Computational Intelligence Society (IEEE-CIS) want to improve this figure, while also improving the customer experience. With higher accuracy fraud detection, you can get on with your chips without the hassle. IEEE-CIS works across a variety of AI and machine learning areas, including deep neural networks, fuzzy systems, evolutionary computation, and swarm intelligence. Today theyre partnering with the worlds leading payment service company, Vesta Corporation, seeking the best solutions for fraud prevention industry, and now you are invited to join the challenge. In this competition, youll benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta's real-world e-commerce transactions and contains a wide range of features from device type to product features. You also have the opportunity to create new features to improve your results. If successful, youll improve the efficacy of fraudulent transaction alerts for millions of people around the world, helping hundreds of thousands of businesses reduce their fraud loss and increase their revenue. And of course, you will save party people just like you the hassle of false positives. Acknowledgements: Vesta Corporation provided the dataset for this competition. Vesta Corporation is the forerunner in guaranteed e-commerce payment solutions. Founded in 1995, Vesta pioneered the process of fully guaranteed card-not-present (CNP) payment transactions for the telecommunications industry. Since then, Vesta has firmly expanded data science and machine learning capabilities across the globe and solidified its position as the leader in guaranteed ecommerce payments. Today, Vesta guarantees more than $18B in transactions annually. Header Photo by Tim Evans on Unsplash`'",,IEEE-CIS Fraud Detection,,Can you detect fraud from customer transactions?,AUC,ieee-cis-fraud-detection 315,"'`Given training data (train.csv) on NYC taxi trips information, we would like to predict their travel duration using machine learning models (linear regression, random forest, boosting, SVR, neural network, etc). The training dataset (train.csv) contains a csv file with ride start and end zones, trip start time (local time), passenger count, trip distance in miles and trip duration in seconds. Each line is a trip and has the following information: row_id, VendorID, pickup_datetime, passenger_count, trip_distance, pickup_borough, dropoff_borough, pickup_zone, dropoff_zone, duration. The test dataset (test.csv) contains a csv file with ride start and end zones, trip start time (local time), passenger count and trip distance in miles which are the same as training data. Trip duration is not provided. Each line is a trip and has the following information: row_id, VendorID, pickup_datetime, passenger_count, trip_distance, pickup_borough, dropoff_borough, pickup_zone, dropoff_zone. Your output file, titled submission.csv, should have two columns: row id (row_id) and estimated duration (duration) in seconds for each line in the test file, where the row_id column refers to the corresponding row in the test file for which you are making the prediction. It should also have the following header specified as the first line of the file: row_id, duration.`'",,IEOR 242 Spring 2020 HW 4,,Show your amazing ideas!,rmse,ieor-242-spring-2020-hw-4 316,"'`What did you eat today? Wondering if you are eating a healthy diet? Automatic food identification can assist towards food intake monitoring to maintain a healthy diet. Food classification is a challenging problem due to the large number of food categories, high visual similarity between different food categories, as well as the lack of datasets that are large enough for training deep models. In this competition, we extend our last year's dataset to 251 fine-grained (prepared) food categories with 118,475 training images collected from the web. We provide human verified labels for both the validation set of 11,994 images and the test set of 28,377 images. The goal is to build a model to predict the fine-grained food-category label given an image. The main challenges are: Fine-grained Classes: The classes are fine-grained and visually similar. For example, the dataset has 15 different types of cakes, and 10 different types of pastas. Noisy Data: Since the training images are crawled from the web, they often include images of raw ingredients or processed and packaged food items. This is referred to as cross-domain noise. Further, due to the fine-grained nature of food-categories, a training image may either be incorrectly labeled into a visually similar class or be annotated with with a single label despite having multiple food items. This competition is part of the fine-grained visual-categorization workshop (FGVC6 workshop) at CVPR 2019 . There is a Github page for the competition here. For any queries you can start a discussion topic here or email us at ifood2019@gmail.com. Participants who make a submission that beats the sample submission can fill out this form to receive $150 in Google Cloud credits. We would like to thank SRI International and Google for support in data collection and labeling. The challenge is sponsored by SRI International.`'",,iFood - 2019 at FGVC6,inClass,Fine-grained classification of food images,meanbesterroratk,ifood-2019-at-fgvc6 317,'`A sample Competition to get us ready for Hackathon`',,IHSM Sample,inClass,Sample competition to get logistics done before the Hackathon,mape,ihsm-sample 318,"'`As shoppers move online, it would be a dream come true to have products in photos classified automatically. But, automatic product recognition is tough because for the same product, a picture can be taken in different lighting, angles, backgrounds, and levels of occlusion. Meanwhile different fine-grained categories may look very similar, for example, royal blue vs turquoise in color. Many of todays general-purpose recognition machines simply cannot perceive such subtle differences between photos, yet these differences could be important for shopping decisions. Tackling issues like this is why the Conference on Computer Vision and Pattern Recognition (CVPR) has put together a workshop specifically for data scientists focused on fine-grained visual categorization called the FGVC5 workshop. As part of this workshop, CVPR is partnering with Google, Wish, and Malong Technologies to challenge the data science community to help push the state of the art in automatic image classification. In this competition, FGVC workshop organizers with Wish and Malong Technologies challenge you to develop algorithms that will help with an important step towards automatic product detection to accurately assign attribute labels for fashion images. Individuals/Teams with top submissions will be invited to present their work live at the FGVC5 workshop. Kaggle is excited to partner with research groups to push forward the frontier of machine learning. Research competitions make use of Kaggle's platform and experience, but are largely organized by the research group's data science team. Any questions or concerns regarding the competition data, quality, or topic will be addressed by them.`'",,iMaterialist Challenge (Fashion) at FGVC5,,Image classification of fashion products.,MeanFScoreBeta,imaterialist-challenge-(fashion)-at-fgvc5 319,"'`As shoppers move online, itd be a dream come true to have products in photos classified automatically. But, automatic product recognition is challenging because for the same product, a picture can be taken in different lighting, angles, backgrounds, and levels of occlusion. Meanwhile different fine-grained categories may look very similar, for example, ball chair vs egg chair for furniture, or dutch oven vs french oven for cookware. Many of todays general-purpose recognition machines simply cant perceive such subtle differences between photos, yet these differences could be important for shopping decisions. Tackling issues like this is why the Conference on Computer Vision and Pattern Recognition (CVPR) has put together a workshop specifically for data scientists focused on fine-grained visual categorization called the FGVC5 workshop. As part of this workshop, CVPR is partnering with Google, Malong Technologies and Wish to challenge the data science community to help push the state of the art in automatic image classification. In this competition, FGVC5 workshop organizers and Malong Technologies challenge you to develop algorithms that will help with an important step towards automatic product recognition to accurately assign category labels for furniture and home goods images. Individuals/Teams with top submissions will be invited to present their work live at the FGVC5 workshop. Kaggle is excited to partner with research groups to push forward the frontier of machine learning. Research competitions make use of Kaggle's platform and experience, but are largely organized by the research group's data science team. Any questions or concerns regarding the competition data, quality, or topic will be addressed by them. `'",,iMaterialist Challenge (Furniture) at FGVC5,,Image Classification of Furniture & Home Goods.,MeanBestErrorAtK,imaterialist-challenge-(furniture)-at-fgvc5 320,"'`Designers know what they are creating, but what, and how, do people really wear their products? What combinations of products are people using? In this competition, we challenge you to develop algorithms that will help with an important step towards automatic product detection to accurately assign segmentations and attribute labels for fashion images. Visual analysis of clothing is a topic that has received increasing attention in recent years. Being able to recognize apparel products and associated attributes from pictures could enhance the shopping experience for consumers, and increase work efficiency for fashion professionals. We present a new clothing dataset with the goal of introducing a novel fine-grained segmentation task by joining forces between the fashion and computer vision communities. The proposed task unifies both categorization and segmentation of rich and complete apparel attributes, an important step toward real-world applications. While early work in computer vision addressed related clothing recognition tasks, these are not designed with fashion insiders needs in mind, possibly due to the research gap in fashion design and computer vision. To address this, we first propose a fashion taxonomy built by fashion experts, informed by product description from the internet. To capture the complex structure of fashion objects and ambiguity in descriptions obtained from crawling the web, our standardized taxonomy contains 46 apparel objects (27 main apparel items and 19 apparel parts), and 92 related fine-grained attributes. Secondly, a total of 50K clothing images (10K with both segmentation and fine-grained attributes, 40K with apparel instance segmentation) in daily-life, celebrity events, and online shopping are labeled by both domain experts and crowd workers for fine-grained segmentation. Individuals/Teams with top submissions will be invited to present their work live at the FGVC6 workshop at the Conference on Computer Vision and Pattern Recognition (CVPR) 2019 Checkout the iMaterialist-Fashion Competition Github repo for the specifics of the dataset. Acknowledgments The iMat-Fashion Challenge 2019 is sponsored by Google AI, CVDF, Samasource and Fashionpedia.`'",,iMaterialist (Fashion) 2019 at FGVC6,,Fine-grained segmentation task for fashion and apparel,custom metric,imaterialist-(fashion)-2019-at-fgvc6 321,"'`Designers know what they are creating, but what, and how, do people really wear their products? What combinations of products are people using? In this competition, we challenge you to develop algorithms that will help with an important step towards automatic product detection to accurately assign segmentations and attribute labels for fashion images. Visual analysis of clothing is a topic that has received increasing attention in recent years. Being able to recognize apparel products and associated attributes from pictures could enhance the shopping experience for consumers, and increase work efficiency for fashion professionals. We present a clothing dataset with the goal of introducing a novel fine-grained segmentation task by joining forces between the fashion and computer vision communities. The proposed task unifies both categorization and segmentation of rich and complete apparel attributes, an important step toward real-world applications. While early work in computer vision addressed related clothing recognition tasks, these are not designed with fashion insiders needs in mind, possibly due to the research gap in fashion design and computer vision. To address this, we first propose a fashion taxonomy built by fashion experts, informed by product description from the internet. To capture the complex structure of fashion objects and ambiguity in descriptions obtained from crawling the web, our standardized taxonomy contains 46 apparel objects (27 main apparel items and 19 apparel parts), and 294 related fine-grained attributes. Secondly, a total of 50K clothing images (with both segmentation masks and fine-grained attributes) in daily-life, celebrity events, and online shopping are labeled by both domain experts and crowd workers for fine-grained segmentation. Individuals/Teams with top submissions will be invited to present their work live at the FGVC7 workshop at the Conference on Computer Vision and Pattern Recognition (CVPR) 2020. Acknowledgments The iMat-Fashion Challenge 2020 is sponsored by Google AI, CVDF, Fashionpedia and Hearst Magazine.`'",,iMaterialist (Fashion) 2020 at FGVC7,,Fine-grained segmentation task for fashion and apparel,custom metric,imaterialist-(fashion)-2020-at-fgvc7 322,'`Nothing`',,IMDB Review,inClass,Home work 3,categorizationaccuracy,imdb-review 323,'`123`',,imdb_sentiment_classification,inClass,applied ai 2020,auc,imdb_sentiment_classification 324,"'`The Metropolitan Museum of Art in New York, also known as The Met, has a diverse collection of over 1.5M objects of which over 200K have been digitized with imagery. The online cataloguing information is generated by Subject Matter Experts (SME) and includes a wide range of data. These include, but are not limited to: multiple object classifications, artist, title, period, date, medium, culture, size, provenance, geographic location, and other related museum objects within The Mets collection. While the SME-generated annotations describe the object from an art history perspective, they can also be indirect in describing finer-grained attributes from the museum-goers understanding. Adding fine-grained attributes to aid in the visual understanding of the museum objects will enable the ability to search for visually related objects. About This is an FGVCx competition hosted as part of the FGVC6 workshop at CVPR 2019. View the github page for more details. This is a Kernels-only competition. Refer to Kernels Requirements for details.`'",,iMet Collection 2019 - FGVC6,,Recognize artwork attributes from The Metropolitan Museum of Art,MeanFScoreBeta,imet-collection-2019-fgvc6 325,"'`The Metropolitan Museum of Art in New York, also known as The Met, has a diverse collection of over 1.5M objects of which over 200K have been digitized with imagery. Can you help find the significant attributes to identify a specific work of art? Help advance this research in this notebook competition. The online cataloguing information is generated by subject matter experts and includes a wide range of data. These include, but are not limited to: multiple object classifications, artist, title, period, date, medium, culture, size, provenance, geographic location, and other related museum objects within The Mets collection. While the annotations describe the object from an art history perspective, they can also be indirect in describing finer-grained attributes for the museum-goers understanding. Adding fine-grained attributes to aid in the visual understanding of the museum objects will enable the ability to search for visually related objects. This is a Code Competition. Refer to Code Requirements for details. About This is an FGVCx competition hosted as part of the FGVC7 workshop at CVPR 2020.`'",,iMet Collection 2020 - FGVC7,,Recognize artwork attributes from The Metropolitan Museum of Art,MeanFScoreBeta,imet-collection-2020-fgvc7 326,"'`As part of the FGVC6 workshop at CVPR 2019 we are conducting the iNat Challenge 2019 large scale species classification competition, sponsored by Microsoft. It is estimated that the natural world contains several million species of plants and animals. Without expert knowledge, many of these species are extremely difficult to accurately classify due to their visual similarity. The goal of this competition is to push the state of the art in automatic image classification for real world data that features a large number of fine-grained categories. Previous versions of the challenge have focused on classifying large numbers of species. This year features a smaller number of highly similar categories captured in a wide variety of situations, from all over the world. In total, the iNat Challenge 2019 dataset contains 1,010 species, with a combined training and validation set of 268,243 images that have been collected and verified by multiple users from iNaturalist. Teams with top submissions, at the discretion of the workshop organizers, will be invited to present their work at the FGVC6 workshop. Participants who make a submission that beats the sample submission can fill out this form to receive $150 in Google Cloud credits. Kaggle is excited to partner with research groups to push forward the frontier of machine learning. Research competitions make use of Kaggle's platform and experience, but are largely organized by the research group's data science team. Any questions or concerns regarding the competition data, quality, or topic will be addressed by them.`'",,iNaturalist 2019 at FGVC6,,Fine-grained classification spanning a thousand species,MeanBestErrorAtK,inaturalist-2019-at-fgvc6 327,"'`With so much diversity, accurately classifying animals and plants is a tough challenge. Check out the photos below. Alpaca or Llama? Donkey or mule? Roses or kale? Its estimated that our planet contains several million species of plants and animalsmany that look really similar to each other. Because of this, a lot of species in the natural world are too hard to classify without an expert. As part of the FGVC4 workshop at CVPR 2017 we are conducting the iNat Challenge 2017 large scale species classification competition, sponsored by Google. It is estimated that the natural world contains several million species of plants and animals. Without expert knowledge, many of these species are extremely difficult to accurately classify due to their visual similarity. The goal of this competition is to push the state of the art in automatic image classification for real world data that features fine-grained categories, big class imbalances, and large numbers of classes. The iNat Challenge 2017 dataset contains 5,089 species, with a combined training and validation set of 675,000 images that have been collected and verified by multiple users from inaturalist.org. The dataset features many visually similar species, captured in a wide variety of situations, from all over the world. Example images, along with their unique GBIF ID numbers (where available), can be viewed here. Teams with top submissions, at the discretion of the workshop organizers, will be invited to present their work at the FGVC4 workshop. Kaggle is excited to partner with research groups to push forward the frontier of machine learning. Research competitions make use of Kaggle's platform and experience, but are largely organized by the research group's data science team. Any questions or concerns regarding the competition data, quality, or topic will be addressed by them.`'",,iNaturalist Challenge at FGVC 2017,research,"Fine-grained classification challenge spanning 5,000 species.",MeanBestErrorAtK,inaturalist-challenge-at-fgvc-2017 328,"'`Making products that work for people all over the globe is an important value at Google AI. In the field of classification, this means developing models that work well for regions all over the world. Today, the dataset a model is trained on greatly dictates the performance of that model. A system trained on a dataset that doesnt represent a broad range of localities could perform worse on images drawn from geographic regions underrepresented in the training data. Google and the industry at large are working to create more diverse & representative datasets. But it is also important for the field to make progress in understanding how to build models when the data available may not cover all audiences a model is meant to reach. Google AI is challenging Kagglers to develop models that are robust to blind spots that might exist in a data set, and to create image recognition systems that can perform well on test images drawn from different geographic distributions than the ones they were trained on. By finding ways to teach image classifiers to generalize to new geographic and cultural contexts, we hope the community will make even more progress in inclusive machine learning that benefits everyone, everywhere. Note: This competition is run in two stages. Refer to the FAQ for an explanation of how this works & the Timeline for specific dates. This competition is a part of the NIPS 2018 competition track. Winners will be invited to attend and present their solutions at the workshop. Shankar et al. ""No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World"" NIPS 2017 Workshop on Machine Learning for the Developing World`'",,Inclusive Images Challenge,,Stress test image classifiers across new geographic distributions,MeanFScoreBeta,inclusive-images-challenge 329,"'`This data set contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The ""Disease"" column is a class label used to divide groups into liver patient (liver disease) or not (no disease). Any patient whose age exceeded 89 is listed as being of age ""90"". Columns: Age of the patient Gender of the patient Total Bilirubin Direct Bilirubin Alkaline Phosphotase Alamine Aminotransferase Aspartate Aminotransferase Total Protiens Albumin Albumin and Globulin Ratio Disease: field used to split the data into two sets (patient with liver disease, or no disease) Acknowledgements This dataset was downloaded from the UCI ML Repository: Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.`'",,Indian Liver Patient Record,inClass,"Predict if a patient has liver disease, or no disease",categorizationaccuracy,indian-liver-patient-record 330,"'`Your smartphone goes everywhere with youwhether driving to the grocery store or shopping for holiday gifts. With your permission, apps can use your location to provide contextual information. You might get driving directions, find a store, or receive alerts for nearby promotions. These handy features are enabled by GPS, which requires outdoor exposure for the best accuracy. Yet, there are many times when youre inside large structures, such as a shopping mall or event center. Accurate indoor positioning, based on public sensors and user permission, allows for a great location-based experience even when you arent outside. Current positioning solutions have poor accuracy, particularly in multi-level buildings, or generalize poorly to small datasets. Additionally, GPS was built for a time before smartphones. Todays use cases often require more granularity than is typically available indoors. In this competition, your task is to predict the indoor position of smartphones based on real-time sensor data, provided by indoor positioning technology company XYZ10 in partnership with Microsoft Research. You'll locate devices using active localization data, which is made available with the cooperation of the user. Unlike passive localization methods (e.g. radar, camera), the data provided for this competition requires explicit user permission. You'll work with a dataset of nearly 30,000 traces from over 200 buildings. If successful, youll contribute to research with broad-reaching possibilities, including industries like manufacturing, retail, and autonomous devices. With more accurate positioning, existing location-based apps could even be improved. Perhaps youll even see the benefits yourself the next time you hit the mall. Acknowledgments XYZ10 is a rising indoor positioning technology company in China. Since 2017, XYZ10 has been accumulating a privacy-sensitive indoor location dataset of WiFi, geomagnetic, and Bluetooth signatures with ground truths from nearly 1,000 buildings. Microsoft Research is the research subsidiary of Microsoft. Its goal is to advance state-of-the-art computing and solve difficult world research-motivated competition problems through technological innovation in collaboration with academic, government, and industry researchers.`'",,Indoor Location & Navigation,,Identify the position of a smartphone in a shopping mall,IndoorLocalization,indoor-location-&-navigation 331,"'`Whether you shop from meticulously planned grocery lists or let whimsy guide your grazing, our unique food rituals define who we are. Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you. Instacarts data science team plays a big part in providing this delightful shopping experience. Currently they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session. Recently, Instacart open sourced this data - see their blog post on 3 Million Instacart Orders, Open Sourced. In this competition, Instacart is challenging the Kaggle community to use this anonymized data on customer orders over time to predict which previously purchased products will be in a users next order. Theyre not only looking for the best model, Instacarts also looking for machine learning engineers to grow their team. Winners of this competition will receive both a cash prize and a fast track through the recruiting process. For more information about exciting opportunities at Instacart, check out their careers page here or e-mail their recruiting team directly at ml.jobs@instacart.com.`'",,Instacart Market Basket Analysis,,Which products will an Instacart consumer purchase again?,MeanFScore,instacart-market-basket-analysis 332,"'`Welcome to Instant (well, almost) Gratification! In 2015, Kaggle introduced Kernels as a resource to competition participants. It was a controversial decision to add a code-sharing tool to a competitive coding space. We thought it was important to make Kaggle more than a place where competitions are solved behind closed digital doors. Since then, Kernels has grown from its infancy--essentially a blinking cursor in a docker container--into its teenage years. We now have more compute, longer runtimes, better datasets, GPUs, and an improved interface. We have iterated and tested several Kernels-only (KO) competition formats with a true holdout test set, in particular deploying them when we would have otherwise substituted a two-stage competition. However, the experience of submitting to a Kernels-only competition has typically been asynchronous and imperfect; participants wait many days after a competition has concluded for their selected Kernels to be rerun on the holdout test dataset, the leaderboard updated, and the winners announced. This flow causes heartbreak to participants whose Kernels fail on the unseen test set, leaving them with no way to correct tiny errors that spoil months of hard work. Say Hello to Synchronous KO We're now pleased to announce general support for a synchronous Kernels-only format. When you submit from a Kernel, Kaggle will run the code against both the public test set and private test set in real time. This small-but-substantial tweak improves the experience for participants, the host, and Kaggle: With a truly withheld test set, we are practicing proper, rigorous machine learning. We will be able to offer more varieties of competitions and intend to run many fewer confusing two-stage competitions. You will be able to see if your code runs successfully on the withheld test set and have the leeway to intervene if it fails. We will run all submissions against the private data, not just selected ones. Participants will get the complete and familiar public/private scores available in a traditional competition. The final leaderboard can be released at the end of the competition, without the delay of rerunning Kernels. This competition is a low-stakes, trial-run introduction to our new synchronous KO implementation. We want to test that the process goes smoothly and gather feedback on your experiences. While it may feel like a normal KO competition, there are complicated new mechanics in play, such as the selection logic of Kernels that are still running when the deadline passes. Since the competition also presents an authentic machine learning problem, it will also award Kaggle medals and points. Have fun, good luck, and welcome to the world of synchronous Kernels competitions!`'",,Instant Gratification,,A synchronous Kernels-only competition,AUC,instant-gratification 333,"'`7. You read that correctly. That's the start to a real integer sequence, the powers of primes. Want something easier? How about the next number in 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55? If you answered 89, you may enjoy this challenge. Your computer may find it considerably less enjoyable. The On-Line Encyclopedia of Integer Sequences is a 50+ year effort by mathematicians the world over to catalog sequences of integers. If it has a pattern, it's probably in the OEIS, and probably described with amazing detail. This competition challenges you create a machine learning algorithm capable of guessing the next number in an integer sequence. While this sounds like pattern recognition in its most basic form, a quick look at the data will convince you this is anything but basic! Acknowledgments Kaggle is hosting this competition for the data science community to use for fun and education. We thank the OEIS and its contributors for cataloging this data.`'",,Integer Sequence Learning,,"1, 2, 3, 4, 5, 7?!",CategorizationAccuracy,integer-sequence-learning 334,"'`Cervical cancer is so easy to prevent if caught in its pre-cancerous stage that every woman should have access to effective, life-saving treatment no matter where they live. Today, women worldwide in low-resource settings are benefiting from programs where cancer is identified and treated in a single visit. However, due in part to lacking expertise in the field, one of the greatest challenges of these cervical cancer screen and treat programs is determining the appropriate method of treatment which can vary depending on patients physiological differences. Especially in rural parts of the world, many women at high risk for cervical cancer are receiving treatment that will not work for them due to the position of their cervix. This is a tragedy: health providers are able to identify high risk patients, but may not have the skills to reliably discern which treatment which will prevent cancer in these women. Even worse, applying the wrong treatment has a high cost. A treatment which works effectively for one woman may obscure future cancerous growth in another woman, greatly increasing health risks. Currently, MobileODT offers a Quality Assurance workflow to support remote supervision which helps healthcare providers make better treatment decisions in rural settings. However, their workflow would be greatly improved given the ability to make real-time determinations about patients treatment eligibility based on cervix type. In this competition, Intel is partnering with MobileODT to challenge Kagglers to develop an algorithm which accurately identifies a womans cervix type based on images. Doing so will prevent ineffectual treatments and allow healthcare providers to give proper referral for cases that require more advanced treatment. Competition Partner MobileODT has developed and sells the Enhanced Visual Assessment (EVA) System, a digital toolkit for health care workers of every level to provide expert services to patients, anchored at the point-of-care by an FDA-approved, intelligent, mobile-phone based medical device. Combining the algorithmic power of biomedical optics with the computational capabilities and connectivity of mobile phones, MobileODT's connected, intelligent medical systems can be used everywhere, under nearly any conditions. MobileODT's first product, the FDA approved EVA System for colposcopy, is in use by health providers in 31 hospital systems across the US, and in 22 countries, to better screen and treat women for cervical cancer and to conduct forensic colposcopy.`'",,Intel & MobileODT Cervical Cancer Screening,,Which cancer treatment will be most effective?,MulticlassLoss,intel-&-mobileodt-cervical-cancer-screening 335,"'`Cars4U, is an on-demand car aggregator and rental service operating in New York. Under its purview, the company has cars which are either owned by the company itself or is listed on the platform by other car owners (Airbnb for Cars). The company has recently raised a $ 10 million Series A funding and is looking to expand its services. In order to do so the company looking to strategically position its cars in various parts of the city by predicting the demand. Cars4U is looking for talented data scientists such as you to help them with predicting the demand for their cars. The company has provided a dataset detailing their historical trends and is now looking for help to predict the demand.`'",,IML - Hackathon,,In class Hackathon,rmse,iml-hackathon 336,"'`Tangles of kudzu overwhelm trees in Georgia while cane toads threaten habitats in over a dozen countries worldwide. These are just two invasive species of many which can have damaging effects on the environment, the economy, and even human health. Despite widespread impact, efforts to track the location and spread of invasive species are so costly that theyre difficult to undertake at scale. Currently, ecosystem and plant distribution monitoring depends on expert knowledge. Trained scientists visit designated areas and take note of the species inhabiting them. Using such a highly qualified workforce is expensive, time inefficient, and insufficient since humans cannot cover large areas when sampling. Because scientists cannot sample a large quantity of areas, some machine learning algorithms are used in order to predict the presence or absence of invasive species in areas that have not been sampled. The accuracy of this approach is far from optimal, but still contributes to approaches to solving ecological problems. In this playground competition, Kagglers are challenged to develop algorithms to more accurately identify whether images of forests and foliage contain invasive hydrangea or not. Techniques from computer vision alongside other current technologies like aerial imaging can make invasive species monitoring cheaper, faster, and more reliable. Acknowledgments Data providers: Christian Requena Mesa, Thore Engel, Amrita Menon, Emma Bradley.`'",,Invasive Species Monitoring,,Identify images of invasive hydrangea,AUC,invasive-species-monitoring 337,"'`Itu kok ada orang-orang yang akurasinya udah bagus di hari pertama? Dikompetisi ini, selain yang ikut recruitment, ada juga orang random atau senior member yang iseng ikut/ nge-test kompetisi kaggle nya. Jadi jangan khawatir! Tetapi karena kompetisi kaggle ini juga sekaligus digunakan untuk mengukur kemampuan senior member, senior member akan dan hanya dapat melakukan submit pada 2 jam terakhir sebelum deadline. Halo! Seperti pada tahun-tahun sebelumnya, Data Science SIG melakukan seleksi penerimaan junior member menggunakan platform Kaggle. Untuk kalian yang belum tahu, Kaggle adalah sebuah platform yang berguna untuk mengadakan kompetisi di bidang Data Science, sharing dataset, dan hal-hal keren lainnya. Kaggle sangat cocok untuk kalian yang mau belajar Data Science! Apa sih Data Science Itu? Jadi data science adalah bidang studi yang berhubungan dengan pengolahan data yang menggabungkan beberapa bidang ilmu mulai dari computer science, matematika, dan berbagai macam bidang ilmu pengetahuan lainnya. Data science itu sendiri luas banget mulai dari analytic yang bermanfaat untuk memberika insight dan biasa digunakan oleh data-driven company untuk mengambil keputusan hingga pembuatan model-model predictive yang berguna untuk membantu manusia dalam kegiatan sehari-hari seperti speech recognition, recommender system, dan lain-lain. Mengapa Data Science itu penting? Di era teknologi yang berkembang pesat ini berdampak pada pertumbuhan jumlah data yang ada. Berlimpahnya jumlah data yang anda harus dimanfaatkan dengan baik dan dalam hal ini, peran data scientist menjadi sangat penting dalam pemanfaatan data tersebut. Apa itu Data Science SIG Ristek? SIG Data Science adalah satau satu special interset group yang ada pada Ristek, dan tentunya SIG ini akan mempelajari Data Science secara lengkap mulai dari pengambilan data, analisis data, pembersihan data, feature engineering, hingga pembuatan predictive model yang dapat menyelesaikan suatu permasalahan. Apa kriteria penilaian untuk recruitment? Tidak ada rumus eksak untuk menghitung penilaian kompetisi ini. Namun, pada lomba Data Science/ Data Analytics/ Data Mining pada umumnya, total nilai adalah 55% dari notebook, dan 45% dari skor pada leaderboard. Artinya, pemahaman dan penjelasan peserta pada notebook sedikit lebih penting daripada skor. Peserta tidak bisa menang jika tidak paham dengan apa yang peserta lakukan. Bagaimana cara menghitung skor total dari leaderboard? Pada leaderboard terdapat public score dan private score, dan pembagian data yang digunakan untuk kompetisi ini adalah 40% public, 60% private. Skor total peserta adalah 40% * publicscore + 60% * privatescore. Apa saja yang dinilai dari notebook/pdf penjelasan? Data preprocessing Exploratory Data Analysis Feature Engineering Modelling Validation Bagaimana jika saya memiliki pertanyaan untuk seleksi ini? Jika berkaitan dengan kaggle, silakan buat discussion. Jika tidak, dapat menghubungi email aabccd021@gmail.com Apakah lomba Data Science itu lomba nge-dukun? Iya, tapi kamu tidak bisa nge-dukun kalau kamu bukan dukun. Apakah Data Scientist itu dukun? Iya Premis untuk permasalahan pada kompetisi ini Bapak Root Mean Square Error (RMSE) ingin membangun usaha koskosan di Kutek. Dia ingin tahu berapa modal yang ia butuhkan untuk membangun koskosan dengan fasilitas yang memadai dan dengan berbagai aspek baik yang lain. Oleh karena itu dia memerlukan bantuan kalian untuk memprediksi harga kos kosan berdasarkan berbagai pertimbangan yang ada. Sebagai teman yang baik hati, dan rajin menabung, kalian ingin membantu Pak RMSE untuk memprediksi harga koskosan yang dapat menjadi pertimbangan bagi usahanya. Referensi https://www.datarobot.com/wiki/data-science/ https://blog.microfocus.com/how-much-data-is-created-on-the-internet-each-day/`'",,Ristik Diti Sciinci Ipin Ricriitmint 2020,inClass,Ristek Data Science Open Recruitment 2020,rootmeansquarepercentageerror,ristik-diti-sciinci-ipin-ricriitmint-2020 338,"'`This is the home page of the competition. You don't need a subtitle here. /p> This is a subtitle Acknowledgements iris dataset demo.`'",,iris dataset demo for AIA student,inClass,iris dataset demo for AIA student,categorizationaccuracy,iris-dataset-demo-for-aia-student 339,"'`Buy low, sell high. It sounds so easy. In reality, trading for profit has always been a difficult problem to solve, even more so in todays fast-moving and complex financial markets. Electronic trading allows for thousands of transactions to occur within a fraction of a second, resulting in nearly unlimited opportunities to potentially find and take advantage of price differences in real time. In a perfectly efficient market, buyers and sellers would have all the agency and information needed to make rational trading decisions. As a result, products would always remain at their fair values and never be undervalued or overpriced. However, financial markets are not perfectly efficient in the real world. Developing trading strategies to identify and take advantage of inefficiencies is challenging. Even if a strategy is profitable now, it may not be in the future, and market volatility makes it impossible to predict the profitability of any given trade with certainty. As a result, it can be hard to distinguish good luck from having made a good trading decision. In the first three months of this challenge, you will build your own quantitative trading model to maximize returns using market data from a major global stock exchange. Next, youll test the predictiveness of your models against future market returns and receive feedback on the leaderboard. Your challenge will be to use the historical data, mathematical tools, and technological tools at your disposal to create a model that gets as close to certainty as possible. You will be presented with a number of potential trading opportunities, which your model must choose whether to accept or reject. In general, if one is able to generate a highly predictive model which selects the right trades to execute, they would also be playing an important role in sending the market signals that push prices closer to fair values. That is, a better model will mean the market will be more efficient going forward. However, developing good models will be challenging for many reasons, including a very low signal-to-noise ratio, potential redundancy, strong feature correlation, and difficulty of coming up with a proper mathematical formulation. Jane Street has spent decades developing their own trading models and machine learning solutions to identify profitable opportunities and quickly decide whether to execute trades. These models help Jane Street trade thousands of financial products each day across 200 trading venues around the world. Admittedly, this challenge far oversimplifies the depth of the quantitative problems Jane Streeters work on daily, and Jane Street is happy with the performance of its existing trading model for this particular question. However, theres nothing like a good puzzle, and this challenge will hopefully serve as a fun introduction to a type of data science problem that a Jane Streeter might tackle on a daily basis. Jane Street looks forward to seeing the new and creative approaches the Kaggle community will take to solve this trading challenge. This is a Code Competition. Refer to Code Requirements for details.`'",,Jane Street Market Prediction,,Test your model against future real market data,JaneStreetPnl,jane-street-market-prediction 340,"'`It only takes one toxic comment to sour an online discussion. The Conversation AI team, a research initiative founded by Jigsaw and Google, builds technology to protect voices in conversation. A main area of focus is machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. If these toxic contributions can be identified, we could have a safer, more collaborative internet. In the previous 2018 Toxic Comment Classification Challenge, Kagglers built multi-headed models to recognize toxicity and several subtypes of toxicity. In 2019, in the Unintended Bias in Toxicity Classification Challenge, you worked to build toxicity models that operate fairly across a diverse range of conversations. This year, we're taking advantage of Kaggle's new TPU support and challenging you to build multilingual models with English-only training data. Jigsaw's API, Perspective, serves toxicity models and others in a growing set of languages (see our documentation for the full list). Over the past year, the field has seen impressive multilingual capabilities from the latest model innovations, including few- and zero-shot learning. We're excited to learn whether these results ""translate"" (pun intended!) to toxicity classification. Your training data will be the English data provided for our previous two competitions and your test data will be Wikipedia talk page comments in several different languages. As our computing resources and modeling capabilities grow, so does our potential to support healthy conversations across the globe. Develop strategies to build effective multilingual models and you'll help Conversation AI and the entire industry realize that potential. Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive. To get started with TPUs: Read the TPU documentation one-pager Then jump right into the Getting Started Notebooks for this competition Quick note: a TPU is a network-connected accelerator and requires a couple extra lines in your code. Flipping the TPU switch in your notebook will not, by itself, accelerate your code.`'",,Jigsaw Multilingual Toxic Comment Classification,,Use TPUs to identify toxicity comments across multiple languages,AUC,jigsaw-multilingual-toxic-comment-classification 341,"'`This competition asks you to determine whether a loan will default, as well as the loss incurred if it does default. Unlike traditional finance-based approaches to this problem, where one distinguishes between good or bad counterparties in a binary way, we seek to anticipate and incorporate both the default and the severity of the losses that result. In doing so, we are building a bridge between traditional banking, where we are looking at reducing the consumption of economic capital, to an asset-management perspective, where we optimize on the risk to the financial investor. This competition is sponsored by researchers at Imperial College London.`'",tabular data,Loan Default Prediction - Imperial College London,research,Constructing an optimal portfolio of loans,MAE,loan-default-prediction-imperial-college-london 342,"'`Update: although the tournament is over, we're continuing our analysis under the predictions dataset page. Back for its third year, March Machine Learning Mania challenges data scientists to predict winners and losers of the men's 2016 NCAA basketball tournament. You're provided data covering three decades of historical NCAA games and freely encouraged to use other sources of data to gain a winning edge. In stage one of this two-stage competition, participants will build and test their models against the previous four tournaments. In the second stage, participants will predict the outcome of the 2016 tournament. You dont need to participate in the first stage to enter the second. The first stage exists to incentivize model building and provide a means to score predictions. The real competition is forecasting the 2016 results. Acknowledgments SAP is the presenting sponsor of March Machine Learning Mania 2016. Please see About the Sponsor to read more.`'",tabular data,March Machine Learning Mania 2016,featured,Predict the 2016 NCAA Basketball Tournament,LogLoss,march-machine-learning-mania-2016 343,"'`With an original Picasso carrying a 106 million dollar price tag, identifying an authentic work of art from a forgery is a high-stakes industry. While algorithms have gotten good at telling us if a still life is of a basket of apples or a sunflower bouquet, they aren't yet able to tell us with certainty if both paintings are by van Gogh. In this playground competition, we're challenging Kagglers to examine pairs of paintings and determine if they are by the same artist. This is an excellent opportunity to improve your computer vision skills and engage with a unique dataset of art. From the movement of brushstrokes to the use of light and dark, successful algorithms will likely incorporate many aspects of a painter's unique style. Resources neural algorithm How Do We See Art: An Eye-Tracker Study Acknowledgments Many of the images in this dataset were obtained from wikiart.org. Additional paintings were provided by artists whose contributions will be acknowledged at the close of the competition. This playground competition and its datasets were prepared by Small Yellow Duck (Kiri Nichol). This includes the design of the pairwise-evaluation scheme.`'",image data,Painter by Numbers,playground,Does every painter leave a fingerprint? ,AUC,painter-by-numbers 344,"'`A lot has been said during the past several years about how precision medicine and, more concretely, how genetic testing is going to disrupt the way diseases like cancer are treated. But this is only partially happening due to the huge amount of manual work still required. Memorial Sloan Kettering Cancer Center (MSKCC) launched this competition, accepted by the NIPS 2017 Competition Track, because we need your help to take personalized medicine to its full potential. Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is distinguishing the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers). Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature. For this competition MSKCC is making available an expert-annotated knowledge base where world-class researchers and oncologists have manually annotated thousands of mutations. We need your help to develop a Machine Learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations. Kaggle is excited to partner with research groups to push forward the frontier of machine learning. Research competitions make use of Kaggle's platform and experience, but are largely organized by the research group's data science team. Any questions or concerns regarding the competition data, quality, or topic will be addressed by them.`'",text data,Personalized Medicine: Redefining Cancer Treatment,research,Predict the effect of Genetic Variants to enable Personalized Medicine,MulticlassLoss,personalized-medicine:-redefining-cancer-treatment 345,"'`Night at Cameo Illinois Tech ACM designed this event in order to: Help Illinois Tech students learn about recommender systems Connect Illinois Tech students with the Cameo team Explore potential applications of recommender systems to Cameo Cameo Learn more about Cameo and their talent on their website. Challenge Each team will develop a recommender system to suggest talent that users may be interested in based on talent that the user is known to like. Get Started Go to the Kernels tab and fork the starter notebook. Read Erick Torres' article about recommender systems. Explore the available functions in the redcarpet module. Use the quick reference to understand the data structures and helper functions. Photo: Sebastian Ervi on Unsplash`'",tabular data,Night at Cameo,inClass,Illinois Institute of Technology ACM + Cameo: Recommend talent that users might like!,map@{k},night-at-cameo 346,"'`Hi! It is boring to wash the dishes. Luckily, half of them are already clean. Train a classifier to determine clean ones to save time for the new machine learning course ;) It is a few shot learning competition. We have a dataset of 20 clean and 20 dirty plates in train and hundreds of plates in test. Good luck!`'",image data,Cleaned vs Dirty,inClass,Classify if a plate is cleaned or dirty?,categorizationaccuracy,cleaned-vs-dirty 347,"'`Who do you think hates traffic more - humans or self-driving cars? The position of nearby automobiles is a key question for autonomous vehicles and it's at the heart of our newest challenge. Self-driving cars have come a long way in recent years, but they're still not flawless. Consumers and lawmakers remain wary of adoption, in part because of doubts about vehicles ability to accurately perceive objects in traffic. Baidu's Robotics and Autonomous Driving Lab (RAL), along with Peking University, hopes to close the gap once and for all with this challenge. Theyre providing Kagglers with more than 60,000 labeled 3D car instances from 5,277 real-world images, based on industry-grade CAD car models. Your challenge: develop an algorithm to estimate the absolute pose of vehicles (6 degrees of freedom) from a single image in a real-world traffic environment. Succeed and you'll help improve computer vision. That, in turn, will bring autonomous vehicles a big step closer to widespread adoption, so they can help reduce the environmental impact of our growing societies. Please cite the following paper when using the dataset: ApolloCar3D: A Large 3D Car Instance Understanding Benchmark for Autonomous Driving @inproceedings{song2019apollocar3d, title={Apollocar3d: A large 3d car instance understanding benchmark for autonomous driving}, author={Song, Xibin and Wang, Peng and Zhou, Dingfu and Zhu, Rui and Guan, Chenye and Dai, Yuchao and Su, Hao and Li, Hongdong and Yang, Ruigang}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, pages={5452--5462}, year={2019} }`'",image data,Peking University/Baidu - Autonomous Driving,featured,Can you predict vehicle angle in different settings?,PKUAutoDrivingAP,peking-university/baidu-autonomous-driving 348,"'`Overview The goal is to build an algorithm to correctly classify tiles (given their additional metadata like patient age and view) into either opacity or no-opacity groups Data The data consists of the metadata in a csv file called _all or _info (for the training and testing groups) as well as the images in a stacked-tiff file where the tile for a given sample can be found by using the slice index.`'",tabular data,Pneumonia Texture Analysis,inClass,Performing Texture Analysis on the RSNA Pneumonia Data,auc,pneumonia-texture-analysis 349,"'` , ? - . ? - , . : - - , . - MACD, RSI, CCI, DMI, ADL `'",tabular data,ML 4 Money,inClass,Predict exchange income,rmse,ml-4-money 350,"'`In the Machine-Learning track of the contest organized by Politecnico di Milano and Oracle Labs you will have the chance to play with multilabel vertex classification, and with many state-of-the-art vertex embedding algorithms. Multilabel classification means that we want to assign to each element in a dataset (in our case, to each vertex in a graph) a list of labels, extracted from a well-defined set. For example, you might want to give to a book a list of topics (narrative, sci-fi, fantasy, ). Or you could imagine having a graph where vertices are recipes, and recipes are connected if they have some ingredient in common: in this case, you could predict recipes tags, e.g. spicy or vegetarian. In this competition, you will work with a list of 24 graphs that represent protein-protein interactions (PPI), and assign to each protein a set of labels that represent its roles. Truth be told, what we really care about is not the specific domain of the dataset, but the ability to create powerful embeddings for each vertex in these graphs. You will study how to transform the information contained in a vertex (either considering just the graph topology or all its features) to a low-dimensional vector, which can then be fed to a classifier to obtain predictions about this vertex labels. Vertex embeddings The study of vertex embeddings has really exploded in the recent past, thanks to the discoveries in the field of Deep Learning and to the abundance of graph data to play with (Google, Facebook, Amazon, ..., you can really make a long list here!). In the Documents page you'll get a list of very interesting papers on the topic, that you can use as a starting point for your experiments. For now, what you want to know is that the starting point of ""modern"" graph embeddings algorithms is called DeepWalk, which creates embeddings by performing random walks on the graph. DeepWalk is available inside Pgx, the graph analysis framework developed by Oracle Labs, and was used to create the first baseline for our predictions. Your job is to improve this result, either by playing with DeepWalk parameters, by finding a better embedding algorithm or by creating one yourself (don't put limits to your ambitions!). Important remarks The main goal of this contest is not to obtain a super amazing prediction accuracy (even though that's also nice to have), but to show that you have a solid grasp of how vertex embeddings work and that you can make good use of them. As such, we are gonna provide you with the classifiers that should be used to make the final predictions. Feel free to fine-tune them a bit to your liking, but don't spend time creating giant ensemble models or things like that, as that's not the skill we are looking for. The same applies to feature engineering or other things you might be used to do if you have any experience with data-science. Don't spend time hand-crafting features or looking online for more features for your proteins! Get in touch with us if you have doubts or questions :) Another thing to keep in mind is that you can easily find online pre-made embedding algorithms that will give you great results, and that are basically the state-of-the-art in this field. Some of them are even linked in the papers we have provided! You can use existing algorithms or even existing code, but you should be well aware of what you are doing. Try different algorithms and spend time fine-tuning existing models or making incremental changes to them, and focus on learning what you are doing, instead of simply aiming for the top spot in the leaderboard! One last important thing! The dataset you are using is freely available online, which means that you can easily cheat in many many different ways, e.g. by training your model on the validation set or by simply copy-pasting the right predictions. You will have to provide us with your code, so that we can check if your results are really farina del tuo sacco or if there is something fishy going on! We decided to use this dataset as it's a very common benchmark in the literature, and it's easy to see how your results compare with the state-of-the-art.`'",tabular data,Oracle Graph ML Contest at Polimi,inClass,Contest organized at Polimi with support from Oracle Labs. Create embeddings on a graph and classify its vertices!,meanfscore,oracle-graph-ml-contest-at-polimi 351,"'`There are various attributes given for pricing of a house. Help us to build an accurate model to predict the sale for a house. Note: try to submit your output by private kernel.`'",tabular data,Kharagpur Data Analytics Group,inClass,Predict the outcome and become the part of our team.,rmse,kharagpur-data-analytics-group 352,'`Homework 4: Pseudo-relevance Feedback`',text data,NTUST: Information Retrieval and Applications,inClass,Homework 2: BM25,map@{k},ntust:-information-retrieval-and-applications 353,"'`Virtual Hackathon Participate in virtual hackathon for scholars of Secure and Private AI Scholarship Challenge from Facebook conducted by #sghackathonorgnizrs. Come join us for a fun filled 5 day of coding and competing against each other. When is it? Hackathon starts => Saturday 00:01am GMT to Thursday 00:01am GMT Coding Time => Saturday 00:01am GMT to Wednesday 00:01am GMT . Commiting Kernel => Wednesday 00:01am GMT to Thursday 00:01am GMT How to participate? Use this form to sign up. You can participate alone or as part of a team of up to 4 individuals. Only 1 member of the team needs to fill the form. https://forms.gle/EXVwAntevyexEqYP8. Please join the #sg_hackathon-orgnizrs channel to get the announcements and ask questions. When will results be announced? On Friday Acknowledgements We thank Udacity and Facebook for this opportunity For more FAQs, please go to our our github page`'",text data,Hackathon Sentimento_v2,inClass,Sentiment Analysis with tweets,categorizationaccuracy,hackathon-sentimento_v2 354,"'`Homework Create a better model to predict likes on Instagram than your counterparts and get more points. Grades 1 place - 15 points 2 place - 12 points 3 place - 10 points 4 place - 9 points 5 place - 8 points 6 place - 7 points 7 place - 6 points 8 place - 5 points 9 place - 4 points 10 place - 3 points 11 place - 2 points 12 place - 1 point Best kernel - 10 points`'",tabular data,Python for Data science ITEA,inClass,ML Homework (Instagram likes prediction),smape,python-for-data-science-itea 355,"'`2020 `'",tabular data,kaggle18011884,,kaggle18011884,rmse,kaggle18011884 356,"'`Note: This is one of the two complementary competitions that together comprise the M5 forecasting challenge. Can you estimate, as precisely as possible, the point forecasts of the unit sales of various products sold in the USA by Walmart? If you are interested in estimating the uncertainty distribution of the realized values of the same series, be sure to check out its companion competition How much camping gear will one store sell each month in a year? To the uninitiated, calculating sales at this level may seem as difficult as predicting the weather. Both types of forecasting rely on science and historical data. While a wrong weather forecast may result in you carrying around an umbrella on a sunny day, inaccurate business forecasts could result in actual or opportunity losses. In this competition, in addition to traditional forecasting methods youre also challenged to use machine learning to improve forecast accuracy. The Makridakis Open Forecasting Center (MOFC) at the University of Nicosia conducts cutting-edge forecasting research and provides business forecast training. It helps companies achieve accurate predictions, estimate the levels of uncertainty, avoiding costly mistakes, and apply best forecasting practices. The MOFC is well known for its Makridakis Competitions, the first of which ran in the 1980s. In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the worlds largest company by revenue, to forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy. If successful, your work will continue to advance the theory and practice of forecasting. The methods used can be applied in various business areas, such as setting up appropriate inventory or service levels. Through its business support and training, the MOFC will help distribute the tools and knowledge so others can achieve more accurate and better calibrated forecasts, reduce waste and be able to appreciate uncertainty and its risk implications. Acknowledgements Additional thanks go to other partner organizations and prize sponsors, National Technical University of Athens (NTUA), INSEAD, Google, Uber and IIF.`'",tabular data,M5 Forecasting - Accuracy,featured,Estimate the unit sales of Walmart retail goods,M5_WRMSSE,m5-forecasting-accuracy 357,"'`Note: This is one of the two complementary competitions that together comprise the M5 forecasting challenge. Can you estimate, as precisely as possible, the uncertainty distribution of the unit sales of various products sold in the USA by Walmart? This specific competition is the first of its kind, opening up new directions for both academic research and how uncertainty could be assessed and used in organizations. If you are interested in providing point (accuracy) forecasts for the same series, be sure to check out its companion competition. How much camping gear will one store sell each month in a year? To the uninitiated, calculating sales at this level may seem as difficult as predicting the weather. Both types of forecasting rely on science and historical data. While a wrong weather forecast may result in you carrying around an umbrella on a sunny day, inaccurate business forecasts could result in actual or opportunity losses. In this competition, in addition to traditional forecasting methods youre also challenged to use machine learning to improve forecast accuracy. The Makridakis Open Forecasting Center (MOFC) at the University of Nicosia conducts cutting-edge forecasting research and provides business forecast training. It helps companies achieve accurate predictions, estimate the levels of uncertainty, avoiding costly mistakes, and apply best forecasting practices. The MOFC is well known for its Makridakis Competitions, the first of which ran in the 1980s. In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the worlds largest company by revenue, to forecast daily sales for the next 28 days and to make uncertainty estimates for these forecasts. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy. If successful, your work will continue to advance the theory and practice of forecasting. The methods used can be applied in various business areas, such as setting up appropriate inventory or service levels. Through its business support and training, the MOFC will help distribute the tools and knowledge so others can achieve more accurate and better calibrated forecasts, reduce waste and be able to appreciate uncertainty and its risk implications. Acknowledgements Additional thanks go to other partner organizations and prize sponsors, National Technical University of Athens (NTUA), INSEAD, Google, Uber and IIF.`'",tabular data,M5 Forecasting - Uncertainty,featured, Estimate the uncertainty distribution of Walmart unit sales. ,WeightedRowwisePinballLoss,m5-forecasting-uncertainty 358,"'`In this competition, you will perform regression using any type of ML model. You will be trying to predict the number of COVID-19 cases in a U.S. county given demographic data. You can use deep learning frameworks and traditional ML libraries (like scikit-learn) for this competition. Demographic data was acquired from United States Department of Agriculture Economic Research Service. COVID-19 cases were scraped from Google on 7/18/2020 and 7/19/2020. You can use any library for this competition.`'",tabular data,NMLO Contest 3 - Regression,inClass,TJML National Machine Learning Open Contest #3,rmse,nmlo-contest-3-regression 359,"'`Welcome to the knit-hack competition. This competition is mainly for Students of Kamla Nehru Institute of Technology,Sultanpur. However you can participate in the competition but would not be eligible for prizes.`'",tabular data,KNIT_HACKS,inClass,This competion is open for all. But only knit students are eligible to win the cash prize.,macrofscore,knit_hacks 360,"'`Welcome to our third Kaggle competition! In this competition, you will create your Deep neural network to predict traveltimes of seismic waves.`'",tabular data,DL for exploration geophysics,inClass,Competition #1,rmse,dl-for-exploration-geophysics 361,"'`Introduction Computer vision has advanced considerably but is still challenged in matching the precision of human perception. Open Images is a collaborative release of ~9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, and visual relationships. This uniquely large and diverse dataset is designed to spur state of the art advances in analyzing and understanding images. This years Open Images V5 release enabled the second Open Images Challenge to include the following 3 tracks: Object detection track for detecting bounding boxes around object instances, relaunched from 2018. Visual relationship detection track for detecting pairs of objects in particular relations, also relaunched from 2018. Instance segmentation track for segmenting masks of objects in images, brand new for 2019. Google AI hopes that having a single dataset with unified annotations for image classification, object detection, visual relationship detection, and instance segmentation will stimulate progress towards genuine scene understanding. Instance Segmentation Track In this track of the Challenge, you are asked to provide segmentation masks of objects. This tracks training set represents 2.1M segmentation masks for object instances in 300 categories; with a validation set containing an additional 23k masks. The train set masks were produced by our state-of-the-art interactive segmentation process, where professional human annotators iteratively correct the output of a segmentation neural network. The validation and test set masks have been annotated manually with a strong focus on quality. Example train set annotations. Left: Wuxi science park, 1995 by Gary Stevens. Right: Cat Cafe Shinjuku calico by Ari Helminen. Both images used under CC BY 2.0 license. The results of this Challenge will be presented at a workshop at the International Conference on Computer Vision. We are excited to partner with Open Images for this second year of competitions, including this brand new track!`'",image data,Open Images 2019 - Instance Segmentation,research,Outline segmentation masks of objects in images,OpenImagesObjDetectionSegmentationAP,open-images-2019-instance-segmentation 362,"'`Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) are working on tools to help improve online conversation. One area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion). So far theyve built a range of publicly available models served through the Perspective API, including toxicity. But the current models still make errors, and they dont allow users to select which types of toxicity theyre interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content). In this competition, youre challenged to build a multi-headed model thats capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate better than Perspectives current models. Youll be using a dataset of comments from Wikipedias talk page edits. Improvements to the current model will hopefully help online discussion become more productive and respectful. Disclaimer: the dataset for this competition contains text that may be considered profane, vulgar, or offensive.`'",,Toxic Comment Classification Challenge,,Identify and classify toxic online comments,MCAUC,toxic-comment-classification-challenge 363,"'`Can you help detect toxic comments and minimize unintended model bias? That's your challenge in this competition. The Conversation AI team, a research initiative founded by Jigsaw and Google (both part of Alphabet), builds technology to protect voices in conversation. A main area of focus is machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. Last year, in the Toxic Comment Classification Challenge, you built multi-headed models to recognize toxicity and several subtypes of toxicity. This year's competition is a related challenge: building toxicity models that operate fairly across a diverse range of conversations. Heres the background: When the Conversation AI team first built toxicity models, they found that the models incorrectly learned to associate the names of frequently attacked identities with toxicity. Models predicted a high likelihood of toxicity for comments containing those identities (e.g. ""gay""), even when those comments were not actually toxic (such as ""I am a gay woman""). This happens because training data was pulled from available sources where unfortunately, certain identities are overwhelmingly referred to in offensive ways. Training a model from data with these imbalances risks simply mirroring those biases back to users. In this competition, you're challenged to build a model that recognizes toxicity and minimizes this type of unintended bias with respect to mentions of identities. You'll be using a dataset labeled for identity mentions and optimizing a metric designed to measure unintended bias. Develop strategies to reduce unintended bias in machine learning models, and you'll help the Conversation AI team, and the entire industry, build models that work well for a wide range of conversations. Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive. Acknowledgments The Conversation AI team would like to thank Civil Comments for making this dataset available publicly and the Online Hate Index Research Project at D-Lab, University of California, Berkeley, whose labeling survey/instrument informed the dataset labeling. We'd also like to thank everyone who has contributed to Conversation AI's research, especially those who took part in our last competition, the success of which led to the creation of this challenge. This is a Kernels-only competition. Refer to Kernels Requirements for details.`'",,Jigsaw Unintended Bias in Toxicity Classification,,Detect toxicity across a diverse range of conversations,JigsawBiasAUC,jigsaw-unintended-bias-in-toxicity-classification 364,"'`Kaggle Startup Programplease apply Successful models will incorporate some analysis of the impact of including different keywords or phrases, as well as making use of the structured data fields like location, hours or company. Some of the structured data shown (such as category) is 'inferred' by Adzuna's own processes, based on where an ad came from or its contents, and may not be ""correct"" but is representative of the real data. You will be provided with a training data set on which to build your model, which will include all variables including salary. A second data set will be used to provide feedback on the public leaderboard. After approximately 6 weeks, Kaggle will release a final data set that does not include the salary field to participants, who will then be required to submit their salary predictions against each job for evaluation.`'",,Job Salary Prediction,featured,Predict the salary of any UK job ad based on its contents,MAE,job-salary-prediction 365,"'` , . , .`'",,Journey to Springfield,,Once Upon a Time in Springfield....,meanfscore,journey-to-springfield 366,"'`Missed the one hour Just the Basics tutorial competition? Didn't get to implement that method you had in mind? Too many coffee breaks has your brain in Beautiful Mind mode? This is the after-party competition. Same data. Same problem. More time! You have until the close of Strata to have fun with the problem. Competition Starts: approximately 12:30 PM PT (3:30 PM ET), 02/26/2013 Competition Ends: 5:00 PM PT (8:00 PM ET), 02/28/2013`'",,Just the Basics - Strata 2013 After-party,,"Live from Santa Clara, CA",AUC,just-the-basics-strata-2013-after-party 367,"'`Overview Welcome to Kaggle's third annual Machine Learning and Data Science Survey and our second-ever survey data challenge. You can read our executive summary here. This year, as in 2017 and 2018, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for three weeks in October, and after cleaning the data we finished with 19,717 responses! There's a lot to explore here. The results include raw numbers about who is working with data, whats happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset. Challenge This year Kaggle is launching the second annual Data Science Survey Challenge, where we will be awarding a prize pool of $30,000 to notebook authors who tell a rich story about a subset of the data science and machine learning community. In our third year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, were inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world. The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A story could be defined any number of ways, and thats deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about! Submissions will be evaluated on the following: Composition - Is there a clear narrative thread to the story thats articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and notebook, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one notebook, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid. How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry. No submission is necessary for the Weekly Notebook Award. To be eligible, a notebook must be public and use the 2019 Data Science Survey as a data source. Submission deadline: 11:59PM UTC, December 2nd, 2019. Survey Methodology This survey received 19,717 usable respondents from 171 countries and territories. If a country or territory received less than 50 respondents, we grouped them into a group named Other for anonymity. We excluded respondents who were flagged by our survey system as Spam. Most of our respondents were found primarily through Kaggle channels, like our email list, discussion forums and social media channels. The survey was live from October 8th to October 28th. We allowed respondents to complete the survey at any time during that window. The median response time for those who participated in the survey was approximately 10 minutes. Not every question was shown to every respondent. You can learn more about the different segments we used in the survey_schema.csv file. In general, respondents with more experience were asked more questions and respondents with less experience were asked less questions. To protect the respondents identity, the answers to multiple choice questions have been separated into a separate data file from the open-ended responses. We do not provide a key to match up the multiple choice and free form responses. Further, the free form responses have been randomized column-wise such that the responses that appear on the same row did not necessarily come from the same survey-taker. Multiple choice single response questions fit into individual columns whereas multiple choice multiple response questions were split into multiple columns. Text responses were encoded to protect user privacy and countries with fewer than 50 respondents were grouped into the category ""other"". Data has been released under a CC 2.0 license: https://creativecommons.org/licenses/by/2.0/`'",,2019 Kaggle Machine Learning & Data Science Survey,,The most comprehensive dataset available on the state of ML and data science,survey analysis,2019-kaggle-machine-learning-&-data-science-survey 368,"'`Welcome to Kaggle's annual Machine Learning and Data Science Survey competition! You can read our executive summary here. This year, as in 2017, 2018, and 2019 we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for 3.5 weeks in October, and after cleaning the data we finished with 20,036 responses! There's a lot to explore here. The results include raw numbers about who is working with data, whats happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset. This year Kaggle is once again launching an annual Data Science Survey Challenge, where we will be awarding a prize pool of $30,000 to notebook authors who tell a rich story about a subset of the data science and machine learning community. In our fourth year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, were inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world. The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A story could be defined any number of ways, and thats deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about! Submissions will be evaluated on the following: Composition - Is there a clear narrative thread to the story thats articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and notebook, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one notebook, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid. How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry. No submission is necessary for the Notebook Award. To be eligible, a notebook must be public and use the 2020 Data Science Survey as a data source. Submission deadline: 11:59PM UTC, January 6th, 2021.`'",,2020 Kaggle Machine Learning & Data Science Survey,,The most comprehensive dataset available on the state of ML and data science ,survey analysis,2020-kaggle-machine-learning-&-data-science-survey 369,"'`Identify users based on how they type the string ""united states"".`'",,Keystroke dynamics challenge 1,inClass,Identify users based on the way they type,categorizationaccuracy,keystroke-dynamics-challenge-1 370,"'` Welcome to the Killer Shrimp Invasion Challenge! Are you a data scientist interested in applying your knowledge to environmental challenges? a marine scientist interested in using Machine Learning in your work? generally passionate about the ocean and keen to learn more about Machine Learning and Marine Science? Then this is the right challenge for you! What is this about? The goal of this challenge is to spur researchers around the world to build innovative machine learning solutions to help monitoring and prediction of invasive species. The challenge focuses on one specific species, the so-called ""Killer Shrimp"" (Dikerogammarus Villosus), and its spread in the Baltic Sea. The results can be applied to other species as well. Invasive Species have become a growing problem in the recent past, causing severe harm to marine ecosystems and those who depend on them [1, 2, 3]. The economic impact alone amounts to several billion dollars annually [4]. The Kaggle Killer Shrimp Invasion Challenge invites you to try your hand at tackling this global problem with data and machine learning! Through your submissions, you will not only build an algorithm, but also help protect the ocean. Plus the winner will receive a monetary prize (150) and the opportunity to present their solution to ODF and its partners in June! Picture: NOAA Great Lakes Environmental Research Laboratory, 1030, published under CC BY-SA 2.0 How to get started? To get started, check out the ""Data"" tab above. To discuss your thoughts with peers or to ask questions, check the ""Discussion"" tab. For more details on the challenge, invasive species, on ODF and anything else, check the sidebar on the left. This is all new to me - what do I do? If you are completely new to Kaggle competitions, take a few minutes to look on the Titanic example and this general guide on Kaggle itself to learn the workflow. If you're entirely new to using Python and to machine learning, check out the Python tutorial and the ML tutorial and this general guide on Kaggle itself. Who can join? Everybody! No matter your background, your level of expertise or your country, we encourage everyone to join. If you are alone and want to team up with others, make a post in the ""Discussion"" tab. Who's behind this Competition? The idea of using machine learning to predict invasive species is based on a project by Sweden-based Ocean Data Factory (ODF). ODF is a triple-helix consortium of academia, public and private organizations with the common goal of enabling data-driven innovation in the global digital blue economy and to help solve environmental challenges. The data sets within this competition were compiled by ODF from open data portals such as EMODnet, Copernicus Marine and SMHI. Good luck! `'",,Killer Shrimp Invasion,,Predict the presence of the invasive species D. Villosus in the Baltic Sea,auc,killer-shrimp-invasion 371,"'`The 11th ACM International Conference on Web Search and Data Mining (WSDM 2018) is challenging you to build an algorithm that predicts whether a subscription user will churn using a donated dataset from KKBOX. WSDM (pronounced ""wisdom"") is one of the the premier conferences on web inspired research involving search and data mining. They're committed to publishing original, high quality papers and presentations, with an emphasis on practical but principled novel models. For a subscription business, accurately predicting churn is critical to long-term success. Even slight variations in churn can drastically affect profits. KKBOX is Asias leading music streaming service, holding the worlds most comprehensive Asia-Pop music library with over 30 million tracks. They offer a generous, unlimited version of their service to millions of people, supported by advertising and paid subscriptions. This delicate model is dependent on accurately predicting churn of their paid users. In this competition youre tasked to build an algorithm that predicts whether a user will churn after their subscription expires. Currently, the company uses survival analysis techniques to determine the residual membership life time for each subscriber. By adopting different methods, KKBOX anticipates theyll discover new insights to why users leave so they can be proactive in keeping users dancing. Winners will present their findings at the WSDM conference February 6-8, 2018 in Los Angeles, CA. For more information on the conference, click here.`'",,WSDM - KKBox's Churn Prediction Challenge,,Can you predict when subscribers will churn?,LogLoss,wsdm-kkboxs-churn-prediction-challenge 372,"'`The 11th ACM International Conference on Web Search and Data Mining (WSDM 2018) is challenging you to build a better music recommendation system using a donated dataset from KKBOX. WSDM (pronounced ""wisdom"") is one of the the premier conferences on web inspired research involving search and data mining. They're committed to publishing original, high quality papers and presentations, with an emphasis on practical but principled novel models. Not many years ago, it was inconceivable that the same person would listen to the Beatles, Vivaldi, and Lady Gaga on their morning commute. But, the glory days of Radio DJs have passed, and musical gatekeepers have been replaced with personalizing algorithms and unlimited streaming services. While the publics now listening to all kinds of music, algorithms still struggle in key areas. Without enough historical data, how would an algorithm know if listeners will like a new song or a new artist? And, how would it know what songs to recommend brand new users? WSDM has challenged the Kaggle ML community to help solve these problems and build a better music recommendation system. The dataset is from KKBOX, Asias leading music streaming service, holding the worlds most comprehensive Asia-Pop music library with over 30 million tracks. They currently use a collaborative filtering based algorithm with matrix factorization and word embedding in their recommendation system but believe new techniques could lead to better results. Winners will present their findings at the conference February 6-8, 2018 in Los Angeles, CA. For more information on the conference, click here, and don't forget to check out the other KKBox/WSDM competition: KKBox Music Churn Prediction Challenge`'",,WSDM - KKBox's Music Recommendation Challenge,,Can you build the best music recommendation system?,AUC,wsdm-kkboxs-music-recommendation-challenge 373,"'`AI CLUB DEMO on KMEANS Model using scipy and matplotlib. 4/6/2018`'",,K Means AI Club Demo,inClass,Simple K means Implementation with minimal visualization,categorizationaccuracy,k-means-ai-club-demo 374,'`winner winner chicken dinner`',,KNM2019,inClass,image2lang,categorizationaccuracy,knm2019 375,"'`Kobe Bryant marked his retirement from the NBA by scoring 60 points in his final game as a Los Angeles Laker on Wednesday, April 12, 2016. Drafted into the NBA at the age of 17, Kobe earned the sports highest accolades throughout his long career. Using 20 years of data on Kobe's swishes and misses, can you predict which shots will find the bottom of the net? This competition is well suited for practicing classification basics, feature engineering, and time series analysis. Practice got Kobe an eight-figure contract and 5 championship rings. What will it get you? Acknowledgements Kaggle is hosting this competition for the data science community to use for fun and education. For more data on Kobe and other NBA greats, visit stats.nba.com.`'",,Kobe Bryant Shot Selection,,Which shots did Kobe sink?,LogLoss,kobe-bryant-shot-selection 376,"'`Build a model to transcribe ancient Kuzushiji into contemporary Japanese characters Imagine the history contained in a thousand years of books. What stories are in those books? What knowledge can we learn from the world before our time? What was the weather like 500 years ago? What happened when Mt. Fuji erupted? How can one fold 100 cranes using only one piece of paper? The answers to these questions are in those books. Japan has millions of books and over a billion historical documents such as personal letters or diaries preserved nationwide. Most of them cannot be read by the majority of Japanese people living today because they were written in Kuzushiji. Even though Kuzushiji, a cursive writing style, had been used in Japan for over a thousand years, there are very few fluent readers of Kuzushiji today (only 0.01% of modern Japanese natives). Due to the lack of available human resources, there has been a great deal of interest in using Machine Learning to automatically recognize these historical texts and transcribe them into modern Japanese characters. Nevertheless, several challenges in Kuzushiji recognition have made the performance of existing systems extremely poor. (More information in About Kuzushiji) This is where you come in. The hosts need help from machine learning experts to transcribe Kuzushiji into contemporary Japanese characters. With your help, Center for Open Data in the Humanities (CODH) will be able to develop better algorithms for Kuzushiji recognition. The model is not only a great contribution to the machine learning community, but also a great help for making millions of documents more accessible and leading to new discoveries in Japanese history and culture. Hosts Center for Open Data in the Humanities (CODH) conducts research and development to enhance access to humanities data using state-of-the-art technology in informatics and statistics. The National Institute of Japanese Literature (NIJL) is an institution which strives to serve researchers in the field of Japanese literature as well as those working in various other humanities, by collecting in one location a vast storage of materials related to Japanese literature gathered from all corners of the country. The National Institute of Informatics (NII) is Japan's only general academic research institution seeking to create future value in the new discipline of informatics. NII seeks to advance integrated research and development activities in information-related fields, including networking, software, and content. Official Collaborators Mikel Bober-Irizar (anokas) Kaggle Grandmaster and Alex Lamb (MILA. Quebec Artificial Intelligence Institute)`'",,Kuzushiji Recognition,,Opening the door to a thousand years of Japanese culture,KNISTMicroF1,kuzushiji-recognition 377,"'`Competio para previso de churn em uma empresa de telecomunicaes Essa competio faz parte da disciplina de Machine Learning do Labdata-FIA Seu objetivo desenvolver um modelo que ir fornecer a melhor acurcia possvel. No se limite apenas a isso, tente tambm encontrar as melhores variveis explicativas e como elas explicam as predies do seu modelo. Sobre os Dados Os dados esto disponveis para serem baixados na aba Data. Um notebook de exemplo est disponvel para demonstrao de como criar e submeter uma soluo no kaggle: Como submeter solues no Kaggle?. Cada linha representa um cliente e cada coluna representa uma informao a respeito daquele cliente. Os dados incluem os seguintes grupos de variveis: Churn: representa se um cliente cancelou ou no os servios contratados Servios que cada cliente contratou - phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies Informaes a respeito da conta do cliente how long theyve been a customer, contract, payment method, paperless billing, monthly charges, and total charges Informaes demogrficas gender, age range, and if they have partners and dependents Para mais informaes e possveis insights, visitar a pgina de notebooks do dataset original: Telco Customer Churn Dataset`'",,Labdata Churn Challenge 2020,,Try to predict whether a client will churn the telco service,categorizationaccuracy,labdata-churn-challenge-2020 378,"'`Did you ever go through your vacation photos and ask yourself: What is the name of this temple I visited in China? Who created this monument I saw in France? Landmark recognition can help! This technology can predict landmark labels directly from image pixels, to help people better understand and organize their photo collections. Today, a great obstacle to landmark recognition research is the lack of large annotated datasets. In this competition, we present the largest worldwide dataset to date, to foster progress in this problem. This competition challenges Kagglers to build models that recognize the correct landmark (if any) in a dataset of challenging test images. Many Kagglers are familiar with image classification challenges like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which aims to recognize 1K general object categories. Landmark recognition is a little different from that: it contains a much larger number of classes (there are more than 200K classes in this challenge), and the number of training examples per class may not be very large. Landmark recognition is challenging in its own way. This is the second edition of this challenge. Compared to the first edition, the new dataset is more comprehensive and diverse. See the Data tab for more in-depth discussion on the new released dataset. This challenge is organized in conjunction with the Landmark Retrieval Challenge. In particular, note that the test set for both challenges is the same, to encourage participants to compete in both. We encourage participants to use the training data from the recognition challenge (either from this years or last years dataset) to develop models which could be useful for the retrieval challenge.`'",,Google Landmark Recognition 2019,,Label famous (and not-so-famous) landmarks in images,GoogleGlobalAP,google-landmark-recognition-2019 379,"'`Welcome to the third Landmark Recognition competition! This year, we have worked to set this up as a code competition and collected a new set of test images. Have you ever gone through your vacation photos and asked yourself: What was the name of that temple I visited in China? or Who created this monument I saw in France? Landmark recognition can help! This technology can predict landmark labels directly from image pixels, to help people better understand and organize their photo collections. This competition challenges Kagglers to build models that recognize the correct landmark (if any) in a dataset of challenging test images. Many Kagglers are familiar with image classification challenges like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which aims to recognize 1K general object categories. Landmark recognition is a little different from that: it contains a much larger number of classes (there are more than 81K classes in this challenge), and the number of training examples per class may not be very large. Landmark recognition is challenging in its own way. In the previous editions of this challenge (2018 and 2019), submissions were handled by uploading prediction files to the system. This year's competition is structured in a synchronous rerun format, where participants need to submit their Kaggle notebooks for scoring. This challenge is organized in conjunction with the Landmark Retrieval Challenge 2020, which was launched June 30, 2020. Both challenges are affiliated with the Instance-Level Recognition workshop in ECCV20. This is a Code Competition. Refer to Code Requirements for details.`'",,Google Landmark Recognition 2020,,Label famous (and not-so-famous) landmarks in images,GoogleGlobalAP,google-landmark-recognition-2020 380,"'`[UPDATE] 2019 challenge launched: https://kaggle.com/c/landmark-recognition-2019 Did you ever go through your vacation photos and ask yourself: What is the name of this temple I visited in China? Who created this monument I saw in France? Landmark recognition can help! This technology can predict landmark labels directly from image pixels, to help people better understand and organize their photo collections. Today, a great obstacle to landmark recognition research is the lack of large annotated datasets. In this competition, we present the largest worldwide dataset to date, to foster progress in this problem. This competition challenges Kagglers to build models that recognize the correct landmark (if any) in a dataset of challenging test images. Many Kagglers are familiar with image classification challenges like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which aims to recognize 1K general object categories. Landmark recognition is a little different from that: it contains a much larger number of classes (there are a total of 15K classes in this challenge), and the number of training examples per class may not be very large. Landmark recognition is challenging in its own way. This challenge is organized in conjunction with the Landmark Retrieval Challenge ( https://www.kaggle.com/c/landmark-retrieval-challenge ). In particular, note that the test set for both challenges is the same, to encourage participants to compete in both. We also encourage participants to use the training data from the recognition challenge to train models which could be useful for the retrieval challenge. Note, however, that there are no landmarks in common between the training/index sets of the two challenges.`'",,Google Landmark Recognition Challenge,,Label famous (and not-so-famous) landmarks in images,GoogleGlobalAP,google-landmark-recognition-challenge 381,"'`Image retrieval is a fundamental problem in computer vision: given a query image, can you find similar images in a large database? This is especially important for query images containing landmarks, which accounts for a large portion of what people like to photograph. In this competition, Kagglers are given query images and, for each query, are expected to retrieve all database images containing the same landmarks (if any). The competition will proceed in two phases: The 1st phase will use the same test and index sets as last year, while for phase 2 we will release a completely new dataset that contains 700K images with more than 100K unique landmarks. We hope that this release will accelerate progress in this important research problem. This challenge is organized in conjunction with the Landmark Recognition Challenge. In particular, note that the test set for both challenges is the same, to encourage participants to compete in both. We also encourage participants to use the training data from the recognition challenge (either from this years or last years dataset) to develop models which could be useful for the retrieval challenge.`'",,Google Landmark Retrieval 2019,,"Given an image, can you find all of the same landmarks in a dataset?",MAP@{K},google-landmark-retrieval-2019 382,"'`Welcome to the third Landmark Retrieval competition! This year, we have worked to set this up as a code competition and we have completely refreshed the test and index image sets. Image retrieval is a fundamental problem in computer vision: given a query image, can you find similar images in a large database? This is especially important for query images containing landmarks, which accounts for a large portion of what people like to photograph. In this competition, the developed models are expected to retrieve relevant database images to a given query image (ie, the model should retrieve database images containing the same landmark as the query). This challenge is organized in conjunction with the Landmark Recognition Challenge 2020. Both challenges will be discussed at the Instance-Level Recognition workshop in ECCV20. In the previous editions of this challenge (2018 and 2019), submissions were handled by uploading prediction files to the system. This year's competition is structured in a representation learning format: rather than creating a submission file with retrieved images, you will create a model that extracts a feature embedding for the images and submit the model via Kaggle Notebooks. Kaggle will run your model on a held-out test set, perform a k-nearest-neighbors lookup, and score the resulting embedding quality with mean average precision. This is a Code Competition. Refer to Code Requirements for details.`'",,Google Landmark Retrieval 2020,,"Given an image, can you find all of the same landmarks in a dataset?",PostProcessorKernelDesc,google-landmark-retrieval-2020 383,"'`[UPDATE] 2019 challenge launched: https://kaggle.com/c/landmark-retrieval-2019 Image retrieval is a fundamental problem in computer vision: given a query image, can you find similar images in a large database? This is especially important for query images containing landmarks, which accounts for a large portion of what people like to photograph. In this competition, Kagglers are given query images and, for each query, are expected to retrieve all database images containing the same landmarks (if any). The new dataset is the largest worldwide dataset for image retrieval research, comprising more than a million images of 15K unique landmarks. We hope that this release will accelerate progress in this important research problem. This challenge is organized in conjunction with the Landmark Recognition Challenge (https://www.kaggle.com/c/landmark-recognition-challenge). In particular, note that the test set for both challenges is the same, to encourage participants to compete in both. We also encourage participants to use the training data from the recognition challenge to train models which could be useful for the retrieval challenge. Note, however, that there are no landmarks in common between the training/index sets of the two challenges.`'",,Google Landmark Retrieval Challenge,research,"Given an image, can you find all of the same landmarks in a dataset?",MAP@{K},google-landmark-retrieval-challenge 384,"'`There are estimated to be nearly half a million species of plant in the world. Classification of species has been historically problematic and often results in duplicate identifications. Automating plant recognition might have many applications, including: The objective of this playground competition is to use binary leaf images and extracted features, including shape, margin & texture, to accurately identify 99 species of plants. Leaves, due to their volume, prevalence, and unique characteristics, are an effective means of differentiating plant species. They also provide a fun introduction to applying techniques that involve image-based features. As a first step, try building a classifier that uses the provided pre-extracted features. Next, try creating a set of your own features. Finally, examine the errors you're making and see what you can do to improve. Acknowledgments Kaggle is hosting this competition for the data science community to use for fun and education. This dataset originates from leaf images collected by James Cope, Thibaut Beghin, Paolo Remagnino, & Sarah Barman of the Royal Botanic Gardens, Kew, UK. Charles Mallah, James Cope, James Orwell. Plant Leaf Classification Using Probabilistic Integration of Shape, Texture and Margin Features. Signal Processing, Pattern Recognition and Applications, in press. 2013. We thank the UCI machine learning repository for hosting the dataset.`'",,Leaf Classification,,Can you see the random forest for the leaves?,MulticlassLoss,leaf-classification 385,"'`A Fortune 100 company, Liberty Mutual Insurance has provided a wide range of insurance products and services designed to meet their customers' ever-changing needs for over 100 years. To ensure that Liberty Mutuals portfolio of home insurance policies aligns with their business goals, many newly insured properties receive a home inspection. These inspections review the condition of key attributes of the property, including things like the foundation, roof, windows and siding. The results of an inspection help Liberty Mutual determine if the property is one they want to insure. In this challenge, your task is to predict a transformed count of hazards or pre-existing damages using a dataset of property information. This will enable Liberty Mutual to more accurately identify high risk homes that require additional examination to confirm their insurability. Liberty Mutual is interested in hiring predictive modelers like you to work on one of many growing analytics teams within our company. As a member of Liberty Mutuals advanced analytics community, you will have the opportunity to apply sophisticated, cutting-edge techniques, similar to those used in this competition, to large data sets in departments such as Actuarial, Product, Claims, Marketing, Distribution, Human Resources, and Finance. Click to view available positions. Because we seek to tap innovation both inside and outside the company, certain eligible Liberty Mutual employees are encouraged to participate in this challenge for development purposes. Refer to the competition rules for the full details.`'",,Liberty Mutual Group: Property Inspection Prediction,,Quantify property hazards before time of inspection,NormalizedGini,liberty-mutual-group:-property-inspection-prediction 386,"'`The Connectivity Map, a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms. What is the Mechanism of Action (MoA) of a drug? And why is it important? In the past, scientists derived drugs from natural products or were inspired by traditional remedies. Very common drugs, such as paracetamol, known in the US as acetaminophen, were put into clinical use decades before the biological mechanisms driving their pharmacological activities were understood. Today, with the advent of more powerful technologies, drug discovery has changed from the serendipitous approaches of the past to a more targeted model based on an understanding of the underlying biological mechanism of a disease. In this new framework, scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target. As a shorthand to describe the biological activity of a given molecule, scientists assign a label referred to as mechanism-of-action or MoA for short. How do we determine the MoAs of a new drug? One approach is to treat a sample of human cells with the drug and then analyze the cellular responses with algorithms that search for similarity to known patterns in large genomic databases, such as libraries of gene expression or cell viability patterns of drugs with known MoAs. In this competition, you will have access to a unique dataset that combines gene expression and cell viability data. The data is based on a new technology that measures simultaneously (within the same samples) human cells responses to drugs in a pool of 100 different cell types (thus solving the problem of identifying ex-ante, which cell types are better suited for a given drug). In addition, you will have access to MoA annotations for more than 5,000 drugs in this dataset. As is customary, the dataset has been split into testing and training subsets. Hence, your task is to use the training dataset to develop an algorithm that automatically labels each case in the test set as one or more MoA classes. Note that since drugs can have multiple MoA annotations, the task is formally a multi-label classification problem. How to evaluate the accuracy of a solution? Based on the MoA annotations, the accuracy of solutions will be evaluated on the average value of the logarithmic loss function applied to each drug-MoA annotation pair. If successful, youll help to develop an algorithm to predict a compounds MoA given its cellular signature, thus helping scientists advance the drug discovery process. This is a Code Competition. Refer to Code Requirements for details.`'",,Mechanisms of Action (MoA) Prediction,,Can you improve the algorithm that classifies drugs based on their biological activity?,MeanColumnwiseLogLoss,mechanisms-of-action-(moa)-prediction 387,"'`Think you can use your data science skills to make big predictions at a submicroscopic level? Many diseases, including cancer, are believed to have a contributing factor in common. Ion channels are pore-forming proteins present in animals and plants. They encode learning and memory, help fight infections, enable pain signals, and stimulate muscle contraction. If scientists could better study ion channels, which may be possible with the aid of machine learning, it could have a far-reaching impact. When ion channels open, they pass electric currents. Existing methods of detecting these state changes are slow and laborious. Humans must supervise the analysis, which imparts considerable bias, in addition to being tedious. These difficulties limit the volume of ion channel current analysis that can be used in research. Scientists hope that technology could enable rapid automatic detection of ion channel current events in raw data. The University of Liverpools Institute of Ageing and Chronic Disease is working to advance ion channel research. Their team of scientists have asked for your help. In this competition, youll use ion channel data to better model automatic identification methods. If successful, youll be able to detect individual ion channel events in noisy raw signals. The data is simulated and injected with real world noise to emulate what scientists observe in laboratory experiments. Technology to analyze electrical data in cells has not changed significantly over the past 20 years. If we better understand ion channel activity, the research could impact many areas related to cell health and migration. From human diseases to how climate change affects plants, faster detection of ion channels could greatly accelerate solutions to major world problems. Acknowledgements: This would not be possible without the help of the Biotechnology and Biological Sciences Research Council (BBSRC).`'",,University of Liverpool - Ion Switching,,Identify the number of channels open at each time point,MacroFScore,university-of-liverpool-ion-switching 388,"'` ! 1, 0 blueWins . : https://youtu.be/sfKJywIEr2U : https://www.kaggle.com/gyejr95/league-of-legends-challenger-ranked-games2020 : https://www.kaggle.com/skyil7/data-processing-for-regression-tasks`'",,League of Legends Winner Prediction,inClass,2020.Spring.AI_termproject_19011484백지오,categorizationaccuracy,league-of-legends-winner-prediction 389,"'`Zero-day : A zero-day vulnerability is a computer-software vulnerability that is unknown to those who would be interested in mitigating the vulnerability. Until the vulnerability is mitigated, hackers can exploit it to adversely affect computer programs, data, additional computers or a network. Zero-day malware cannot be caught as such using the conventional malware detection system which heavily depend on the manual creation of signature to detect malware files, based of the research done in Max Secure Software laboratories, it was found out that Machine learning techniques could perform well in detecting zero-day vulnerabilities (focused of malware files), if trained properly on suitable models. We bring you the opportunity to work on the data extracted from the statistical analysis conducted of a large set of malware and legitimate files to build ML programs that can perform efficiently in detecting malware files. There is one catch though, it is necessary for malware detection systems to ensure no legitimate files are predicted as malware, because deleting a legitimate file could lead to serious loss of the users. Final evaluation of the candidates considered for the cash price which will be done manually by a jury, if your model predicts any legitimate files as malicious - strong penalty will be imposed which may lead you to lose the competition even if you rank well on kaggle score boards. Ensure that the false positive predictions are as low as 0.001% . The test-data for the competition has been updated , please download and use the updated data for submission The penalty for false negative classifications will be relative. The final ranking on evaluation will look only on the the number of false negatives. More false negative will lead to a lower rank. By false negative here i mean the number of legitimate files considered as malicious Thank you for being part of the competition, the final evaluation will start on the 24th of October 2018, the complete schedule will be intimated on the 25th of October 2018. Please bear with us. For further details contact us at anandp@iitbhilai.ac.in`'",,Malware Detection,inClass,"Make your own Malware security system, in association with Meraz'18 malware security partner Max Secure Software",categorizationaccuracy,malware-detection 390,"'`At Kaggle HQ and in offices across the country, March is a month when bracketology is in bloom. Back by popular demand, our second annual March Machine Learning Mania competition pits you against the millions of sports fans and office-pool bandwagoners who are hoping to win big by correctly predicting the outcome of the men's NCAA basketball tournament. While the odds of forecasting a perfect bracket are astronomical, these odds are improved by the growing amount of data collected throughout the season, including player statistics, tournament seeds, geographical factors and social media. How well can machine learning and statistical techniques improve the forecast? Presented by HP Software's industry leading Big Data group and the HP Haven Big Data platform, this competition will test how well predictions based on data stack up against a (jump) shot in the dark. This competition allows you to get creative with the datasets you use to create your model. We provide data covering three decades of historical games, but you're highly encouraged to pull in data from external sources. The 50+ REST APIs from HP IDOL OnDemand are a great way to get started augmenting the dataset. Developer accounts are free and includes free monthly quota! Begin by extracting trending topics and identifying entities from the IDOL OnDemand news dataset (accessed via the Query Text Index API) or by analyzing public sentiment about players and teams using data from your social media feed. In stage one of this two-stage competition, participants will build and test their models against the previous four tournaments. In the second stage, participants will predict the outcome of the 2015 tournament. You dont need to participate in the first stage to enter the second, but the first stage exists to incentivize model building and provide a means to score predictions. The real competition is forecasting the 2015 results, for which youll predict winning percentages for the likelihood of each possible matchup, not just a traditional bracket. HP is sponsoring $15,000 in cash prizes for the winners. Please visit the FAQs for more information. Acknowledgements March Machine Learning Mania 2015 is presented by HP. Please see About the sponsor to read more.`'",,March Machine Learning Mania 2015,featured,Predict the 2015 NCAA Basketball Tournament,LogLoss,march-machine-learning-mania-2015 391,"'`In this assignment, you will label medical notes samples into one of five clinical domains. Each medical note comes from exactly one of the following five clinical domains: Gastroenterology Neurology Orthopedic Radiology Urology The training and test datasets are not uniformly distributed across these domains -- i.e. some domains are represented more often than others in both training and test datasets. The training dataset consists of medical notes, one note per file. Notes have some structure, but mostly describe the clinical procedures administered to patients. Acknowledgements The dataset is derived from the medical transcription samples available at http://mtsamples.com/`'",,Medical Notes Classification,,Classify the medical notes to the medical specialty,categorizationaccuracy,medical-notes-classification 392,"'`It can be hard to know how much somethings really worth. Small details can mean big differences in pricing. For example, one of these sweaters cost $335 and the other cost $9.99. Can you guess which ones which? Product pricing gets even harder at scale, considering just how many products are sold online. Clothing has strong seasonal pricing trends and is heavily influenced by brand names, while electronics have fluctuating prices based on product specs. Mercari, Japans biggest community-powered shopping app, knows this problem deeply. Theyd like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari's marketplace. In this competition, Mercaris challenging you to build an algorithm that automatically suggests the right product prices. Youll be provided user-inputted text descriptions of their products, including details like product category name, brand name, and item condition. Note that, because of the public nature of this data, this competition is a Kernels Only competition. In the second stage of the challenge, files will only be available through Kernels and you will not be able to modify your approach in response to new data. Read more details in the data tab and Kernels FAQ page.`'",,Mercari Price Suggestion Challenge,,Can you automatically suggest product prices to online sellers?,RMSLE,mercari-price-suggestion-challenge 393,"'`Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium car makers. Daimlers Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams. . To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimlers engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the worlds biggest manufacturers of premium cars, safety and efficiency are paramount on Daimlers production lines. In this competition, Daimler is challenging Kagglers to tackle the curse of dimensionality and reduce the time that cars spend on the test bench. Competitors will work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimlers standards.`'",,Mercedes-Benz Greener Manufacturing,,Can you cut the time a Mercedes-Benz spends on the test bench?,R2Score,mercedes-benz-greener-manufacturing 394,"'`The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways. With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security. As one part of their overall strategy for doing so, Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences. Can you help protect more than one billion machines from damage BEFORE it happens? Acknowledgements This competition is hosted by Microsoft, Windows Defender ATP Research, Northeastern University College of Computer and Information Science, and Georgia Tech Institute for Information Security & Privacy. Microsoft contacts Rob McCann (Robert.McCann@microsoft.com) Christian Seifert (chriseif@microsoft.com) Susan Higgs (Susan.Higgs@microsoft.com) Matt Duncan (Matthew.Duncan@microsoft.com) Northeastern University contact Mansour Ahmadi (m.ahmadi@northeastern.edu) Georgia Tech contacts Brendan Saltaformaggio (brendan@ece.gatech.edu) Taesoo Kim (taesoo@gatech.edu)`'",,Microsoft Malware Prediction,,Can you predict if a machine will soon be hit with malware?,AUC,microsoft-malware-prediction 395,"'` linear regression PM2.5 train set test set 20 train.csv: 20 test.csv: 10 240 9 PM2.5`'",,ML2020spring - hw1,inClass,Regression - PM2.5 Prediction,rmse,ml2020spring-hw1 396,"'`Fake currency notes is a huge issue in the banking system of any economy. To deal with them, we were provided with features extracted by images of genuine and fake currency notes. On the basis of data provided, the notes were classified as Forged or Genuine. Given the dataset with the features in train.csv, establish a classifier which is able to correctly classify the forged (0) and genuine (1) labels for the currency notes.`'",,MLCC NSEC,inClass,MLCC NSEC Study Jam Contest,categorizationaccuracy,mlcc-nsec 397,"'`Esta competencia consiste en desarrollar una red convolucional para clasificacin de imgenes. Se utilizar el dataset CIFAR-100, que consiste en imgenes fotogrficas a color de 32x32 pxeles, clasificadas en 100 clases distintas. El dataset tiene un set de train de 50.000 imgenes (500 de cada clase) y un set de test de 10.000 imgenes (100 de cada clase) El objetivo es explorar distintas redes y configuraciones de hiperparmetros, utilizando el set de train para entrenar los modelos. Para aplicar tcnicas de validacin cruzada u otros muestreos de validacin, deber utilizar el set de train. Se le mostrarn las imgenes de testeo pero no contar con las etiquetas de las clases correspondientes.`'",,CNN en CIFAR-100,,Clasificación de imágenes usando perceptrón multicapa,categorizationaccuracy,cnn-en-cifar-100 398,"'`""There's a thin line between likably old-fashioned and fuddy-duddy, and The Count of Monte Cristo ... never quite settles on either side."" The Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis, originally collected by Pang and Lee [1]. In their work on sentiment treebanks, Socher et al. [2] used Amazon's Mechanical Turk to create fine-grained labels for all parsed phrases in the corpus. This competition presents a chance to benchmark your sentiment-analysis ideas on the Rotten Tomatoes dataset. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging. Kaggle is hosting this competition for the machine learning community to use for fun and practice. This competition was inspired by the work of Socher et al [2]. We encourage participants to explore the accompanying (and dare we say, fantastic) website that accompanies the paper: http://nlp.stanford.edu/sentiment/ There you will find have source code, a live demo, and even an online interface to help train the model. [1] Pang and L. Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, pages 115124. [2] Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Chris Manning, Andrew Ng and Chris Potts. Conference on Empirical Methods in Natural Language Processing (EMNLP 2013).`'",,Movie Review Sentiment Analysis (Kernels Only),,Classify the sentiment of sentences from the Rotten Tomatoes dataset,CategorizationAccuracy,movie-review-sentiment-analysis-(kernels-only) 399,"'`This competition, hosted by the uktML club, is an easy introduction to building predictive models with machine learning. Using the provided training data you'll be given a 1000 character sample of a movie script and be asked to predict the genre the movie came from.`'",,Movies,inClass,Can you predict the genre of movie from just 1000 characters of the script?,categorizationaccuracy,movies 400,"'`Start here if... You have some experience with R or Python and machine learning basics. This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition. Competition Description Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 30~ explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. Practice Skills Creative feature engineering Advanced regression techniques like random forest and gradient boosting Acknowledgments The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. `'",,MTI Bootcamp day 3,,yeay,rmsle,mti-bootcamp-day-3 401,"'`The Neural Information Processing Scaled for Bioacoustics (NIPS4B) bird song competition asks participants to identify which of 87 sound classes of birds and their ecosystem are present in 1000 continuous wild recordings from different places in Provence, France. The data is provided by the BIOTOPE society, which maintains the largest collection of wild recordings of birds in Europe. This challenge is a more complex task than the previous ICML4B challenge, in which 77 teams participated (see proceedings at sabiod.org). For more information about the Neural Information Processing Scaled for Bioacoustics workshop, please visit the official site. Organizers Pr. H. Glotin - Institut Universitaire de France, CNRS LSIS and USTV, glotin@univ-tln.fr O. Dufour - CNRS LSIS, FR Dr. Y. Bas - BIOTOPE, FR`'",,Multi-label Bird Species Classification - NIPS 2013,research,Identify which of 87 classes of birds and amphibians are present into 1000 continuous wild sound recordings,AUC,multi-label-bird-species-classification-nips-2013 402,"'`In this playground competition, hosted in partnership with Google Cloud and Coursera, you are tasked with predicting the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations. While you can get a basic estimate based on just the distance between the two points, this will result in an RMSE of $5-$8, depending on the model used (see the starter code for an example of this approach in Kernels). Your challenge is to do better than this using Machine Learning techniques! To learn how to handle large datasets with ease and solve this problem using TensorFlow, consider taking the Machine Learning with TensorFlow on Google Cloud Platform specialization on Coursera -- the taxi fare problem is one of several real-world problems that are used as case studies in the series of courses. To make this easier, head to Coursera.org/NEXTextended to claim this specialization for free for the first month!`'",,New York City Taxi Fare Prediction,,Can you predict a rider's taxi fare?,RMSE,new-york-city-taxi-fare-prediction 403,"'`The running back takes the handoff he breaks a tacklespins and breaks free! One man to beat! Past the 50-yard-line! To the 40! The 30! He! Could! Go! All! The! Way! But will he? American football is a complex sport. From the 22 players on the field to specific characteristics that ebb and flow throughout the game, it can be challenging to quantify the value of specific plays and actions within a play. Fundamentally, the goal of football is for the offense to run (rush) or throw (pass) the ball to gain yards, moving towards, then across, the opposing teams side of the field in order to score. And the goal of the defense is to prevent the offensive team from scoring. In the National Football League (NFL), roughly a third of teams offensive yardage comes from run plays.. Ball carriers are generally assigned the most credit for these plays, but their teammates (by way of blocking), coach (by way of play call), and the opposing defense also play a critical role. Traditional metrics such as yards per carry or total rushing yards can be flawed; in this competition, the NFL aims to provide better context into what contributes to a successful run play. As an armchair quarterback watching the game, you may think you can predict the result of a play when a ball carrier takes the handoff - but what does the data say? In this competition, you will develop a model to predict how many yards a team will gain on given rushing plays as they happen. You'll be provided game, play, and player-level data, including the position and speed of players as provided in the NFLs Next Gen Stats data. And the best part - you can see how your model performs from your living room, as the leaderboard will be updated week after week on the current seasons game data as it plays out. Deeper insight into rushing plays will help teams, media, and fans better understand the skill of players and the strategies of coaches. It will also assist the NFL and its teams evaluate the ball carrier, his teammates, his coach, and the opposing defense, in order to make adjustments as necessary. Additionally, the winning model will be provided to the NFLs Next Gen Stats group to potentially share with teams. You could help the NFL Network generate models to use during games, or for pre-game/post-game breakdowns.`'",,NFL Big Data Bowl,,How many yards will an NFL player gain after receiving a handoff?,CRPS,nfl-big-data-bowl 404,"'`When a quarterback takes a snap and drops back to pass, what happens next may seem like chaos. As offensive players move in various patterns, the defense works together to prevent successful pass completions and then to quickly tackle receivers that do catch the ball. In this years Kaggle competition, your goal is to use data science to better understand the schemes and players that make for a successful defense against passing plays. In American football, there are a plethora of defensive strategies and outcomes. The National Football League (NFL) has used previous Kaggle competitions to focus on offensive plays, but as the old proverb goes, defense wins championships. Though metrics for analyzing quarterbacks, running backs, and wide receivers are consistently a part of public discourse, techniques for analyzing the defensive part of the game trail and lag behind. Identifying player, team, or strategic advantages on the defensive side of the ball would be a significant breakthrough for the game. This competition uses NFLs Next Gen Stats data, which includes the position and speed of every player on the field during each play. Youll employ player tracking data for all drop-back pass plays from the 2018 regular season. The goal of submissions is to identify unique and impactful approaches to measure defensive performance on these plays. There are several different directions for participants to tackle (ha)which may require levels of football savvy, data aptitude, and creativity. As examples: What are coverage schemes (man, zone, etc) that the defense employs? What coverage options tend to be better performing? Which players are the best at closely tracking receivers as they try to get open? Which players are the best at closing on receivers when the ball is in the air? Which players are the best at defending pass plays when the ball arrives? Is there any way to use player tracking data to predict whether or not certain penalties for example, defensive pass interference will be called? Who are the NFLs best players against the pass? How does a defense react to certain types of offensive plays? Is there anything about a player for example, their height, weight, experience, speed, or position that can be used to predict their performance on defense? What does data tell us about defending the pass play? You are about to find out. Note: Are you a university participant? Students have the option to participate in a college-only Competition, where youll work on the identical themes above. Students can opt-in for either the Open or College Competitions, but not both.`'",,NFL Big Data Bowl 2021,,Help evaluate defensive performance on passing plays,football,nfl-big-data-bowl-2021 405,"'`The National Football League (NFL) has teamed up with Amazon Web Services (AWS) to develop the Digital Athlete, a virtual representation of a composite NFL player that the NFL can use to model game scenarios to try to better predict and prevent player injury. The NFL is actively addressing the need for a computer vision system to detect on-field helmet impacts as part of the Digital Athlete platform, and the league is calling on Kagglers to help. In this competition, youll develop a computer vision model that automatically detects helmet impacts that occur on the field. Kick off with a dataset of more than one thousand definitive head impacts from thousands of game images, labeled video from the sidelines and end zones, and player tracking data. This information is sourced from the NFLs Next Gen Stats (NGS) system, which documents the position, speed, acceleration, and orientation for every player on the field during NFL games. This competition is part of the NFLs annual 1st and Future competition, which is designed to spur innovation in athlete safety and performance. For the first time this year, 1st and Future will be broadcast in primetime during Super Bowl LV week on NFL Network, and winning Kagglers may have the opportunity to present their computer vision systems as part of this exciting event. If successful, you could support the NFLs research programs in a big way: improving athletes' safety. Backed by this research, the NFL may implement rule changes and helmet design improvements to try to better protect the athletes who play the game millions watch each week. The National Football League is America's most popular sports league. Founded in 1920, the NFL developed the model for the successful modern sports league and is committed to advancing progress in the diagnosis, prevention, and treatment of sports-related injuries. Health and safety efforts include support for independent medical research and engineering advancements as well as a commitment to work to better protect players and make the game safer, including enhancements to medical protocols and improvements to how our game is taught and played. For more information about the NFL's health and safety efforts, please visit NFL.com/PlayerHealthandSafety. This is a Code Competition. Refer to Code Requirements for details.`'",,NFL 1st and Future - Impact Detection,,Detect helmet impacts in videos of NFL plays,PostProcessorKernelDesc,nfl-1st-and-future-impact-detection 406,"'`Welcome In this challenge, you're tasked to investigate the relationship between the playing surface and the injury and performance of National Football League (NFL) athletes and to examine factors that may contribute to lower extremity injuries. You'll also notice there isn't a leaderboard, and you are not required to develop a predictive model. This isn't a traditional supervised Kaggle machine learning competition. For more information on this challenge format, see this forum thread. This challenge is part of NFL 1st & Future, the NFLs annual Super Bowl competition designed to spur innovation in player health, safety and performance. The Challenge In the NFL, 12 stadiums have fields with synthetic turf. Recent investigations of lower limb injuries among football athletes have indicated significantly higher injury rates on synthetic turf compared with natural turf (Mack et al., 2018; Loughran et al., 2019). In conjunction with the epidemiologic investigations, biomechanical studies of football cleat-surface interactions have shown that synthetic turf surfaces do not release cleats as readily as natural turf and may contribute to the incidence of non-contact lower limb injuries (Kent et al., 2015). Given these differences in cleat-turf interactions, it has yet to be determined whether player movement patterns and other measures of player performance differ across playing surfaces and how these may contribute to the incidence of lower limb injury. Now, the NFL is challenging Kagglers to help them examine the effects that playing on synthetic turf versus natural turf can have on player movements and the factors that may contribute to lower extremity injuries. NFL player tracking, also known as Next Gen Stats, is the capture of real time location data, speed and acceleration for every player, every play on every inch of the field. As part of this challenge, the NFL has provided full player tracking of on-field position for 250 players over two regular season schedules. One hundred of the athletes in the study data set sustained one or more injuries during the study period that were identified as a non-contact injury of a type that may have turf interaction as a contributing factor to injury. The remaining 150 athletes serve as a representative sample of the larger NFL population that did not sustain a non-contact lower-limb injury during the study period. Details of the surface type and environmental parameters that may influence performance and outcome are also provided. Your challenge is to characterize any differences in player movement between the playing surfaces and identify specific scenarios (e.g., field surface, weather, position, play type, etc.) that interact with player movement to present an elevated risk of injury. More details on the entry criteria are available in Evaluation Tab. About The NFL The National Football League is America's most popular sports league, comprised of 32 franchises that compete each year to win the Super Bowl, the world's biggest annual sporting event. Founded in 1920, the NFL developed the model for the successful modern sports league, including national and international distribution, extensive revenue sharing, competitive excellence, and strong franchises across the country. The NFL is committed to advancing progress in the diagnosis, prevention and treatment of sports-related injuries. The NFL's ongoing health and safety efforts include support for independent medical research and engineering advancements and a commitment to work to better protect players and make the game safer, including enhancements to medical protocols and improvements to how our game is taught and played. As more is learned, the league evaluates and changes rules to evolve the game and try to improve protections for players. Since 2002 alone, the NFL has made 50 rules changes intended to eliminate potentially dangerous tactics and reduce the risk of injuries. For more information about the NFL's health and safety efforts, please visit www.PlaySmartPlaySafe.com Evaluation`'",,NFL 1st and Future - Analytics,analytics,Can you investigate the relationship between the playing surface and the injury and performance of NFL athletes?,tabular data,nfl-1st-and-future-analytics 407,"'`This research competition doesn't follow Kaggle's normal submission process. See the Submission Format tab for more details. Most existing machine learning classifiers are highly vulnerable to adversarial examples. An adversarial example is a sample of input data which has been modified very slightly in a way that is intended to cause a machine learning classifier to misclassify it. In many cases, these modifications can be so subtle that a human observer does not even notice the modification at all, yet the classifier still makes a mistake. Adversarial examples pose security concerns because they could be used to perform an attack on machine learning systems, even if the adversary has no access to the underlying model. To accelerate research on adversarial examples, Google Brain is organizing Competition on Adversarial Attacks and Defenses within the NIPS 2017 competition track. The competition on Adversarial Attacks and Defenses consist of three sub-competitions: Non-targeted Adversarial Attack. The goal of the non-targeted attack is to slightly modify source image in a way that image will be classified incorrectly by generally unknown machine learning classifier. Targeted Adversarial Attack. The goal of the targeted attack is to slightly modify source image in a way that image will be classified as specified target class by generally unknown machine learning classifier. Defense Against Adversarial Attack. The goal of the defense is to build machine learning classifier which is robust to adversarial example, i.e. can classify adversarial images correctly. In each of the sub-competitions you're invited to make and submit a program which solves the corresponding task. In the end of the competition we will run all attacks against all defenses to evaluate how each of the attacks performs against each of the defenses. Kaggle is excited to partner with research groups to push forward the frontier of machine learning. Research competitions make use of Kaggle's platform and experience, but are largely organized by the research group's data science team. Any questions or concerns regarding the competition data, rules, quality, or topic will be addressed by them.`'",,NIPS 2017: Defense Against Adversarial Attack,,Create an image classifier that is robust to adversarial attacks,Score,nips-2017:-defense-against-adversarial-attack 408,"'`This research competition doesn't follow Kaggle's normal submission process. See the Submission Format tab for more details. Most existing machine learning classifiers are highly vulnerable to adversarial examples. An adversarial example is a sample of input data which has been modified very slightly in a way that is intended to cause a machine learning classifier to misclassify it. In many cases, these modifications can be so subtle that a human observer does not even notice the modification at all, yet the classifier still makes a mistake. Adversarial examples pose security concerns because they could be used to perform an attack on machine learning systems, even if the adversary has no access to the underlying model. To accelerate research on adversarial examples, Google Brain is organizing Competition on Adversarial Attacks and Defenses within the NIPS 2017 competition track. The competition on Adversarial Attacks and Defenses consist of three sub-competitions: Non-targeted Adversarial Attack. The goal of the non-targeted attack is to slightly modify source image in a way that image will be classified incorrectly by generally unknown machine learning classifier. Targeted Adversarial Attack. The goal of the targeted attack is to slightly modify source image in a way that image will be classified as specified target class by generally unknown machine learning classifier. Defense Against Adversarial Attack. The goal of the defense is to build machine learning classifier which is robust to adversarial example, i.e. can classify adversarial images correctly. In each of the sub-competitions you're invited to make and submit a program which solves the corresponding task. In the end of the competition we will run all attacks against all defenses to evaluate how each of the attacks performs against each of the defenses. Kaggle is excited to partner with research groups to push forward the frontier of machine learning. Research competitions make use of Kaggle's platform and experience, but are largely organized by the research group's data science team. Any questions or concerns regarding the competition data, rules, quality, or topic will be addressed by them.`'",,NIPS 2017: Non-targeted Adversarial Attack,,Imperceptibly transform images in ways that fool classification models,Score,nips-2017:-non-targeted-adversarial-attack 409,"'`This research competition doesn't follow Kaggle's normal submission process. See the Submission Format tab for more details. Most existing machine learning classifiers are highly vulnerable to adversarial examples. An adversarial example is a sample of input data which has been modified very slightly in a way that is intended to cause a machine learning classifier to misclassify it. In many cases, these modifications can be so subtle that a human observer does not even notice the modification at all, yet the classifier still makes a mistake. Adversarial examples pose security concerns because they could be used to perform an attack on machine learning systems, even if the adversary has no access to the underlying model. To accelerate research on adversarial examples, Google Brain is organizing Competition on Adversarial Attacks and Defenses within the NIPS 2017 competition track. The competition on Adversarial Attacks and Defenses consist of three sub-competitions: Non-targeted Adversarial Attack. The goal of the non-targeted attack is to slightly modify source image in a way that image will be classified incorrectly by generally unknown machine learning classifier. Targeted Adversarial Attack. The goal of the targeted attack is to slightly modify source image in a way that image will be classified as specified target class by generally unknown machine learning classifier. Defense Against Adversarial Attack. The goal of the defense is to build machine learning classifier which is robust to adversarial example, i.e. can classify adversarial images correctly. In each of the sub-competitions you're invited to make and submit a program which solves the corresponding task. In the end of the competition we will run all attacks against all defenses to evaluate how each of the attacks performs against each of the defenses. Kaggle is excited to partner with research groups to push forward the frontier of machine learning. Research competitions make use of Kaggle's platform and experience, but are largely organized by the research group's data science team. Any questions or concerns regarding the competition data, rules, quality, or topic will be addressed by them.`'",,NIPS 2017: Targeted Adversarial Attack,research,Develop an adversarial attack that causes image classifiers to predict a specific target class,Score,nips-2017:-targeted-adversarial-attack 410,"'`This is the second part of NJU CS elite program's visit to HKUST. You have a larger dataset with more features. Your goal is to predict the running time of some computer programs. Description This competition is about modeling the performance of computer programs. The dataset provided describes a few examples of running SGDClassifier in Python. The features of the dataset describes the SGDClassifier as well as the features used to generate the synthetic training data. The data to be analyzed is the training time of the SGDClassifier.`'",,NJU 2019 visit @ HKUST,inClass,NJU 2019 visit @ HKUST SING Lab,rmse,nju-2019-visit-@-hkust 411,"'`Welcome to one of our ""Getting Started"" competitions This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you dont have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks. Competition Description Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency theyre observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies). But, its not always clear whether a persons words are actually announcing a disaster. Take this example: The author explicitly uses the word ABLAZE but means it metaphorically. This is clear to a human right away, especially with the visual aid. But its less clear to a machine. In this competition, youre challenged to build a machine learning model that predicts which Tweets are about real disasters and which ones arent. Youll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running. Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive. Acknowledgments This dataset was created by the company figure-eight and originally shared on their Data For Everyone website here. Tweet source: https://twitter.com/AnyOtherAnnaK/status/629195955506708480`'",,Natural Language Processing with Disaster Tweets,,Predict which Tweets are about real disasters and which ones are not,meanfscore,natural-language-processing-with-disaster-tweets 412,"'`In this competition, you have to predict the genre of a song given a segment of lyrics. The two genres are pop(1) and rap(0). Lyrics were scraped from Genius.com. We have censored several words by replacing letters with asterisks. Please note that some lyrics were taken from explicit songs. Genre was determined based on Billboard Charts. You can use deep learning and NLP libraries for this competition.`'",,NMLO Contest 5 - Basic NLP,inClass,TJML National Machine Learning Open Contest #5,categorizationaccuracy,nmlo-contest-5-basic-nlp 413,"'`This is a 6-class image classification task. Predict your classes as {0, 1, 2, 3, 4, 5} . Labels are expected to be integer(int) type, so if you plan to use regression or any other method, round off the values to integer type only! Only 10 submissions per day are allowed Usage of Pre-trained model weights is prohibited and if used, will be penalised. Solution is expected to be an Convolutional Neural Network, with no restriction on the framework used (Keras, Tensorflow, caffe, Pytorch, what-not). Verify the usage of any libraries and other doubts with the TAs in the Discussion Forum on kaggle only, so that the announcements and doubts are clarified to everyone. Weights to be stored for the best submission in a ""model.h5"" file. While submitting predictions your target column should be named 'label' The Submission file should only have 2 columns : ['image_name' , 'label'] All the best, and have fun! `'",,NNFL Lab2-CNN,inClass,Build a Convolutional Neural Network for Multiclass Classification,categorizationaccuracy,nnfl-lab2-cnn 414,"'`Steller sea lions in the western Aleutian Islands have declined 94 percent in the last 30 years. The endangered western population, found in the North Pacific, are the focus of conservation efforts which require annual population counts. Specially trained scientists at NOAA Fisheries Alaska Fisheries Science Center conduct these surveys using airplanes and unoccupied aircraft systems to collect aerial images. Having accurate population estimates enables us to better understand factors that may be contributing to lack of recovery of Stellers in this area. Currently, it takes biologists up to four months to count sea lions from the thousands of images NOAA Fisheries collects each year. Once individual counts are conducted, the tallies must be reconciled to confirm their reliability. The results of these counts are time-sensitive. In this competition, Kagglers are invited to develop algorithms which accurately count the number of sea lions in aerial photographs. Automating the annual population count will free up critical resources allowing NOAA Fisheries to focus on ensuring we hear the sea lions roar for many years to come. Plus, advancements in computer vision applied to aerial population counts may also greatly benefit other endangered species. Resources Learn more about research being done to better understand what's going on with the endangered Steller sea lion populations by joining scientists on a research vessel to the western Aleutian Islands in the video below.`'",,NOAA Fisheries Steller Sea Lion Population Count,,How many sea lions do you see?,MCRMSE,noaa-fisheries-steller-sea-lion-population-count 415,"'`With fewer than 500 North Atlantic right whales left in the world's oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction. Currently, only a handful of very experienced researchers can identify individual whales on sight while out on the water. For the majority of researchers, identifying individual whales takes time, making it difficult to effectively target whales for biological samples, acoustic recordings, and necessary health assessments. To track and monitor the population, right whales are photographed during aerial surveys and then manually matched to an online photo-identification catalog. Customized software has been developed to aid in this process (DIGITS), but this still relies on a manual inspection of the potential comparisons, and there is a lag time for those images to be incorporated into the database. The current identification process is extremely time consuming and requires special training. This constrains marine biologists, who work under tight deadlines with limited budgets. This competition challenges you to automate the right whale recognition process using a dataset of aerial photographs of individual whales. Automating the identification of right whales would allow researchers to better focus on their conservation efforts. Recognizing a whale in real-time would also give researchers on the water access to potentially life-saving historical health and entanglement records as they struggle to free a whale that has been accidentally caught up in fishing gear. Acknowledgements MathWorks is sponsoring the competition prize pool. If your team is participating in this competition MathWorks is also providing complimentary software. Click here for more details on how to request your copy. Thanks to Christin Khan and Leah Crowe from NOAA for hand labeling the images to create this one of a kind dataset and to the right whale research team at New England Aquarium for maintaining the photo-identification catalog. Without their continued efforts, none of this would be possible. `'",,Right Whale Recognition,research,Identify endangered right whales in aerial photographs ,MulticlassLoss,right-whale-recognition 416,"'`Innovative materials design is needed to tackle some of the most important health, environmental, energy, social, and economic challenges of this century. In particular, improving the properties of materials that are intrinsically connected to the generation and utilization of energy is crucial if we are to mitigate environmental damage due to a growing global demand. Transparent conductors are an important class of compounds that are both electrically conductive and have a low absorption in the visible range, which are typically competing properties. A combination of both of these characteristics is key for the operation of a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of compounds are currently known to display both transparency and conductivity suitable enough to be used as transparent conducting materials. Aluminum Al, gallium Ga, indium In sesquioxides are some of the most promising transparent conductors because of a combination of both large bandgap energies, which leads to optical transparency over the visible range, and high conductivities. These materials are also chemically stable and relatively inexpensive to produce. Alloying of these binary compounds in ternary or quaternary mixtures could enable the design of a new material at a specific composition with improved properties over what is current possible. These alloys are described by the formula (AlxGayInz)2NO3N; where x, y, and z can vary but are limited by the constraint x+y+z = 1. The total number of atoms in the`'",,Nomad2018 Predicting Transparent Conductors,,Predict the key properties of novel transparent semiconductors,MCRMSLE,nomad2018-predicting-transparent-conductors 417,"'`In this competition, Kaggle is challenging you to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables. Longtime Kagglers will recognize that this competition objective is similar to the ECML/PKDD trip time challenge we hosted in 2015. But, this challenge comes with a twist. Instead of awarding prizes to the top finishers on the leaderboard, this playground competition was created to reward collaboration and collective learning. We are encouraging you (with cash prizes!) to publish additional training data that other participants can use for their predictions. We also have designated bi-weekly and final prizes to reward authors of kernels that are particularly insightful or valuable to the community.`'",,New York City Taxi Trip Duration,,Share code and data to improve ride time predictions,RMSLE,new-york-city-taxi-trip-duration 418,'`CIFAR 10 playground.`',,ods_class_cs231n,inClass,CIFAR10 playground,multiclassloss,ods_class_cs231n 419,"'`This is a private binary classification competition for the course ""Data Science and Machine Learning 3: Tools"" course, part of the MSc in Business Analytics program at CEU in the Winter semester of 2018/19. In this competition your task is to predict which articles are shared the most in social media. The data comes from website mashable.com from beginning of 2015. The dataset used in the competition can be found at the UCI repository - of course, you should not cheat by checking out the whole dataset found there. There is a public leaderboard which is computed on about 30% of the test data and a private one that is computed on the remaining observations. You can use the public leaderboard to evaluate your current standing till the deadline date. The final rankings will be based on the private leaderboard. Bear in mind that by submitting many times to the public leaderboard and making decisions upon these results you might overfit your model. Acknowledgments The dataset used in the competition can be found at the UCI repository. We thank Kelwin Fernandes, Pedro Vinagre, Paulo Cortez and Pedro Sernadela for making it publicly available. Check also their publication below. K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.`'",,Online news popularity,inClass,Predict which articles will be shared the most!,auc,online-news-popularity 420,"'`Introduction Computer vision has advanced considerably but is still challenged in matching the precision of human perception. Open Images is a collaborative release of ~9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, and visual relationships. This uniquely large and diverse dataset is designed to spur state of the art advances in analyzing and understanding images. This years Open Images V5 release enabled the second Open Images Challenge to include the following 3 tracks: Object detection track for detecting bounding boxes around object instances, relaunched from 2018. Visual relationship detection track for detecting pairs of objects in particular relations, also relaunched from 2018. Instance segmentation track for segmenting masks of objects in images, brand new for 2019. Google AI hopes that having a single dataset with unified annotations for image classification, object detection, visual relationship detection, and instance segmentation will stimulate progress towards genuine scene understanding. Object Detection Track In this track of the Challenge, you are asked to predict a tight bounding box around object instances. The training set contains 12.2M bounding-boxes across 500 categories on 1.7M images. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (7 per image on average). Example annotations. Left: Mark Paul Gosselaar plays the guitar by Rhys A. Right: the house by anita kluska. Both images used under CC BY 2.0 license. Please refer to the Open Images 2019 Challenge page for additional details. The challenge contains a total of 3 tracks, which are linked above in the introduction. You are invited to explore and enter as many tracks as interest you. The results of this Challenge will be presented at a workshop at the International Conference on Computer Vision. We are excited to partner with Open Images for this second year of competitions. See link here for last years Object Detection competition.`'",,Open Images 2019 - Object Detection,,Detect objects in varied and complex images,OpenImagesObjectDetectionAP,open-images-2019-object-detection 421,"'`Introduction Computer vision has advanced considerably but is still challenged in matching the precision of human perception. Open Images is a collaborative release of ~9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives. This uniquely large and diverse dataset is designed to spur state of the art advances in analyzing and understanding images. This year the Open Images Instance Segmentation competition is a part of the larger Robust Vision Challenge 2020. This challenge encourages the participants to develop robust computer vision algorithms able to perform well across multiple datasets. Please refer to the RVC 2020 page and the Open Images Challenge page for more details. Participants are also welcome to submit to this playground competition beyond the context of RVC. Instance Segmentation Track In this track of the Challenge, you are asked to provide segmentation masks of objects. This tracks training set represents 2.1M segmentation masks for object instances in 300 categories; with a validation set containing an additional 23k masks. The train set masks were produced by our state-of-the-art interactive segmentation process, where professional human annotators iteratively correct the output of a segmentation neural network. The validation and test set masks have been annotated manually with a strong focus on quality. Example train set annotations. Left: Wuxi science park, 1995 by Gary Stevens. Right: Cat Cafe Shinjuku calico by Ari Helminen. Both images used under CC BY 2.0 license. The training data, format, and submission modalities are identical to the 2019 Open Images Challenge.`'",,Open Images Instance Segmentation RVC 2020 edition,,Outline segmentation masks of objects in images,OpenImagesObjDetectionSegmentationAP,open-images-instance-segmentation-rvc-2020-edition 422,"'`Introduction Computer vision has advanced considerably but is still challenged in matching the precision of human perception. Open Images is a collaborative release of ~9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives. This uniquely large and diverse dataset is designed to spur state of the art advances in analyzing and understanding images. This year the Open Images Object Detection competition is a part of the larger Robust Vision Challenge 2020. This challenge encourages the participants to develop robust computer vision algorithms able to perform well across multiple datasets. Please refer to the RVC 2020 page and the Open Images Challenge page for more details. Participants are also welcome to submit to this playground competition beyond the context of RVC. Object Detection Track In this track, you are asked to predict a tight bounding box around object instances. The training set contains 12.2M bounding-boxes across 500 categories on 1.7M images. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (7 per image on average). Example annotations. Left: Mark Paul Gosselaar plays the guitar by Rhys A. Right: the house by anita kluska. Both images used under CC BY 2.0 license. The training data, format, and submission modalities are identical to the 2019 Open Images Challenge.`'",,Open Images Object Detection RVC 2020 edition,,Detect objects in varied and complex images,OpenImagesObjectDetectionAP,open-images-object-detection-rvc-2020-edition 423,"'` OpenEdulog , log , Log JSON 20142017 6log6000,event(hacklog2014.rar) https://goo.gl/XHsebg `'",,OpenEdu Learning Data Analysis,inClass,Predict whether the student is a student that we are interested in.,meanfscore,openedu-learning-data-analysis 424,"'`Welcome Everyone! This is an time series based data science competition. This competition requires you to predict the Hourly averageglobalreactive_power for a particular house hold, whose data has been provided. In the data tab, you will find the dataset for a particular household from Paris, France. Objective is to predict Hourly average active Power for the month of October, 2010 The Evaluation tab contains details about the judgement criteria Note: There is no restriction on the method or methods to use. Users are required to submit their final submission before the deadline, strictly in the specified format in sample_submission.csv file. The public leader board score contains only a part of the evaluated results. Final decision will be based on evaluation of both private and public leader board score.`'",,Oracle Idea Day Challenge,,Forecast daily consumption for the next month.,rmse,oracle-idea-day-challenge 425,"'`Imagine one day, your breathing became consistently labored and shallow. Months later you were finally diagnosed with pulmonary fibrosis, a disorder with no known cause and no known cure, created by scarring of the lungs. If that happened to you, you would want to know your prognosis. Thats where a troubling disease becomes frightening for the patient: outcomes can range from long-term stability to rapid deterioration, but doctors arent easily able to tell where an individual may fall on that spectrum. Your help, and data science, may be able to aid in this prediction, which would dramatically help both patients and clinicians. Current methods make fibrotic lung diseases difficult to treat, even with access to a chest CT scan. In addition, the wide range of varied prognoses create issues organizing clinical trials. Finally, patients suffer extreme anxietyin addition to fibrosis-related symptomsfrom the diseases opaque path of progression. Open Source Imaging Consortium (OSIC) is a not-for-profit, co-operative effort between academia, industry and philanthropy. The group enables rapid advances in the fight against Idiopathic Pulmonary Fibrosis (IPF), fibrosing interstitial lung diseases (ILDs), and other respiratory diseases, including emphysematous conditions. Its mission is to bring together radiologists, clinicians and computational scientists from around the world to improve imaging-based treatments. In this competition, youll predict a patients severity of decline in lung function based on a CT scan of their lungs. Youll determine lung function based on output from a spirometer, which measures the volume of air inhaled and exhaled. The challenge is to use machine learning techniques to make a prediction with the image, metadata, and baseline FVC as input. If successful, patients and their families would better understand their prognosis when they are first diagnosed with this incurable lung disease. Improved severity detection would also positively impact treatment trial design and accelerate the clinical development of novel treatments. This is a Code Competition. Refer to Code Requirements for details.`'",,OSIC Pulmonary Fibrosis Progression,,Predict lung function decline,LaplaceLogLikelihood,osic-pulmonary-fibrosis-progression 426,"'`Get started on this competition through Kaggle Scripts The Otto Group is one of the worlds biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France). We are selling millions of products worldwide every day, with several thousand products being added to our product line. A consistent analysis of the performance of our products is crucial. However, due to our diverse global infrastructure, many identical products get classified differently. Therefore, the quality of our product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights we can generate about our product range. For this competition, we have provided a dataset with 93 features for more than 200,000 products. The objective is to build a predictive model which is able to distinguish between our main product categories. The winning models will be open sourced.`'",,Otto Group Product Classification Challenge,,Classify products into the correct category,MulticlassLoss,otto-group-product-classification-challenge 427,"'`The internet is a stimulating treasure trove of possibility. Every day we stumble on news stories relevant to our communities or experience the serendipity of finding an article covering our next travel destination. Outbrain, the webs leading content discovery platform, delivers these moments while we surf our favorite sites. Currently, Outbrain pairs relevant content with curious readers in about 250 billion personalized recommendations every month across many thousands of sites. In this competition, Kagglers are challenged to predict which pieces of content its global base of users are likely to click on. Improving Outbrains recommendation algorithm will mean more users uncover stories that satisfy their individual tastes.`'",,Outbrain Click Prediction,,Can you predict which recommended content each user will click?,MAP@{K},outbrain-click-prediction 428,"'`About Us One Fourth Labs is an IIT Madras incubated startup with a goal to make India ready for the AI age. We want to skill Indian workforce in the areas of Artificial Intelligence (AI) at almost one-tenth the industry upskilling price. Our flagship online school PadhAI provides India-specific courses on AI, and is open to all students, faculty, and professionals with a basic background in mathematics and python Task The goal is to identify the presence of a character in images using MP Neuron / Perceptron / Perceptron with sigmoid. The character images are compiled in Tamil, Hindi and English. We have altered the task in 4 levels with increase in data complexity. Evaluation Metric Submissions are evaluated on Accuracy Score between the predicted and the actual labels on test dataset Acknowledgements Tamil Character Data: http://www.jfn.ac.lk/index.php/data-sets-printed-tamil-characters-printed-documents/ Hindi Character Data: https://www.kaggle.com/ashokpant/devanagari-character-dataset`'",,PadhAI: Text - Non Text Classification Level 2,,Can you predict whether an image has TEXT or NOT?,categorizationaccuracy,padhai:-text-non-text-classification-level-2 429,"'` 17011771 https://youtu.be/_yCC6P5vZks baseline colab code : https://colab.research.google.com/drive/1t28FWa4Ha0DyrSLM-M4RsEcI_IobLuqq?usp=sharing`'",,Parking lot,inClass,공영주차장의 토요일 무료 혹은 유료 여부,categorizationaccuracy,parking-lot 430,"'`You have to predict if a person exercises daily or not based on his/her brief medical history and other information. The answer is in terms of ""yes"" or ""no"" only.`'",,PASC Data-Quest 2.0-2.0,inClass,Take Me Higher!,categorizationaccuracy,pasc-data-quest-2.0-2.0 431,"'`While long lines and frantically shuffling luggage into plastic bins isnt a fun experience, airport security is a critical and necessary requirement for safe travel. No one understands the need for both thorough security screenings and short wait times more than U.S. Transportation Security Administration (TSA). Theyre responsible for all U.S. airport security, screening more than two million passengers daily. As part of their Apex Screening at Speed Program, DHS has identified high false alarm rates as creating significant bottlenecks at the airport checkpoints. Whenever TSAs sensors and algorithms predict a potential threat, TSA staff needs to engage in a secondary, manual screening process that slows everything down. And as the number of travelers increase every year and new threats develop, their prediction algorithms need to continually improve to meet the increased demand. Currently, TSA purchases updated algorithms exclusively from the manufacturers of the scanning equipment used. These algorithms are proprietary, expensive, and often released in long cycles. In this competition, TSA is stepping outside their established procurement process and is challenging the broader data science community to help improve the accuracy of their threat prediction algorithms. Using a dataset of images collected on the latest generation of scanners, participants are challenged to identify the presence of simulated threats under a variety of object types, clothing types, and body types. Even a modest decrease in false alarms will help TSA significantly improve the passenger experience while maintaining high levels of security. This is a two-stage competition. Please read our two-stage FAQs to understand more about what this means. All persons contained in the dataset are volunteers who have agreed to have their images used for this competition. The images may contain sensitive content. We kindly request that you conduct yourself with professionalism, respect, and maturity when working with this data.`'",,Passenger Screening Algorithm Challenge,featured,Improve the accuracy of the Department of Homeland Security's threat recognition algorithms,LogLoss,passenger-screening-algorithm-challenge 432,"'`Welcome! Today we will predict images using PCA. The data contains predictors that are single pixels. We want to predict whether a number is a 8 or a 9.`'",,PCA for image recognition,inClass,Reduce the dimensions of your data with PCA!,categorizationaccuracy,pca-for-image-recognition 433,"'`Millions of stray animals suffer on the streets or are euthanized in shelters every day around the world. If homes can be found for them, many precious lives can be saved and more happy families created. PetFinder.my has been Malaysias leading animal welfare platform since 2008, with a database of more than 150,000 animals. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare. Animal adoption rates are strongly correlated to the metadata associated with their online profiles, such as descriptive text and photo characteristics. As one example, PetFinder is currently experimenting with a simple AI tool called the Cuteness Meter, which ranks how cute a pet is based on qualities present in their photos. In this competition you will be developing algorithms to predict the adoptability of pets - specifically, how quickly is a pet adopted? If successful, they will be adapted into AI tools that will guide shelters and rescuers around the world on improving their pet profiles' appeal, reducing animal suffering and euthanization. Top participants may be invited to collaborate on implementing their solutions into AI tools for assessing and improving pet adoption performance, which will benefit global animal welfare. Important Note Be aware that this is being run as a Kernels Only Competition, requiring that all submissions be made via a Kernel output. Photo by Krista Mangulsone on Unsplash`'",,PetFinder.my Adoption Prediction,,How cute is that doggy in the shelter?,QuadraticWeightedKappa,petfinder.my-adoption-prediction 434,"'`The taxi industry is evolving rapidly. New competitors and technologies are changing the way traditional taxi services do business. While this evolution has created new efficiencies, it has also created new problems. One major shift is the widespread adoption of electronic dispatch systems that have replaced the VHF-radio dispatch systems of times past. These mobile data terminals are installed in each vehicle and typically provide information on GPS localization and taximeter state. Electronic dispatch systems make it easy to see where a taxi has been, but not necessarily where it is going. In most cases, taxi drivers operating with an electronic dispatch system do not indicate the final destination of their current ride. Another recent change is the switch from broadcast-based (one to many) radio messages for service dispatching to unicast-based (one to one) messages. With unicast-messages, the dispatcher needs to correctly identify which taxi they should dispatch to a pick up location. Since taxis using electronic dispatch systems do not usually enter their drop off location, it is extremely difficult for dispatchers to know which taxi to contact. To improve the efficiency of electronic taxi dispatching systems it is important to be able to predict the final destination of a taxi while it is in service. Particularly during periods of high demand, there is often a taxi whose current ride will end near or exactly at a requested pick up location from a new rider. If a dispatcher knew approximately where their taxi drivers would be ending their current rides, they would be able to identify which taxi to assign to each pickup request. The spatial trajectory of an occupied taxi could provide some hints as to where it is going. Similarly, given the taxi id, it might be possible to predict its final destination based on the regularity of pre-hired services. In a significant number of taxi rides (approximately 25%), the taxi has been called through the taxi call-center, and the passengers telephone id can be used to narrow the destination prediction based on historical ride data connected to their telephone id. In this challenge, we ask you to build a predictive framework that is able to infer the final destination of taxi rides in Porto, Portugal based on their (initial) partial trajectories. The output of such a framework must be the final trip's destination (WGS84 coordinates). This is the first of two data science challenges that share the same dataset. The Taxi Service Trip Time competition predicts the total time of taxi rides. This competition is affiliated with the organization of ECML/PKDD 2015.`'",,ECML/PKDD 15: Taxi Trajectory Prediction (I),,Predict the destination of taxi trips based on initial partial trajectories,AHD@{Type},ecml/pkdd-15:-taxi-trajectory-prediction-(i) 435,"'`Every minute, the world loses an area of forest the size of 48 football fields. And deforestation in the Amazon Basin accounts for the largest share, contributing to reduced biodiversity, habitat loss, climate change, and other devastating effects. But better data about the location of deforestation and human encroachment on forests can help governments and local stakeholders respond more quickly and effectively. Planet, designer and builder of the worlds largest constellation of Earth-imaging satellites, will soon be collecting daily imagery of the entire land surface of the earth at 3-5 meter resolution. While considerable research has been devoted to tracking changes in forests, it typically depends on coarse-resolution imagery from Landsat (30 meter pixels) or MODIS (250 meter pixels). This limits its effectiveness in areas where small-scale deforestation or forest degradation dominate. Furthermore, these existing methods generally cannot differentiate between human causes of forest loss and natural causes. Higher resolution imagery has already been shown to be exceptionally good at this, but robust methods have not yet been developed for Planet imagery. In this competition, Planet and its Brazilian partner SCCON are challenging Kagglers to label satellite image chips with atmospheric conditions and various classes of land cover/land use. Resulting algorithms will help the global community better understand where, how, and why deforestation happens all over the world - and ultimately how to respond. To dig into/explore more Planet data, sign up for a free account. And if you're interested in building applications on Planet data, check out our Application Developer Program. Getting Started Review the data page, which includes detailed information about the labels and the labeling process. Download a subsample of the data to get familiar with how it looks. Explore the subsample on Kernels. Weve created a notebook for you to get started.`'",,Planet: Understanding the Amazon from Space,,Use satellite data to track the human footprint in the Amazon rainforest,MeanFScoreBeta,planet:-understanding-the-amazon-from-space 436,"'`Problem Statement Misdiagnosis of the many diseases impacting agricultural crops can lead to misuse of chemicals leading to the emergence of resistant pathogen strains, increased input costs, and more outbreaks with significant economic loss and environmental impacts. Current disease diagnosis based on human scouting is time-consuming and expensive, and although computer-vision based models have the promise to increase efficiency, the great variance in symptoms due to age of infected tissues, genetic variations, and light conditions within trees decreases the accuracy of detection. Specific Objectives Objectives of Plant Pathology Challenge are to train a model using images of training dataset to 1) Accurately classify a given image from testing dataset into different diseased category or a healthy leaf; 2) Accurately distinguish between many diseases, sometimes more than one on a single leaf; 3) Deal with rare classes and novel symptoms; 4) Address depth perceptionangle, light, shade, physiological age of the leaf; and 5) Incorporate expert knowledge in identification, annotation, quantification, and guiding computer vision to search for relevant features during learning. Resources Details and background information on the dataset and Kaggle competition Plant Pathology 2020 Challenge were published. If you use the dataset for your project, please cite the following peer-reviewed research article Thapa, Ranjita; Zhang, Kai; Snavely, Noah; Belongie, Serge; Khan, Awais. The Plant Pathology Challenge 2020 data set to classify foliar disease of apples. Applications in Plant Sciences, 8 (9), 2020. Acknowledgments We acknowledge financial support from Cornell Initiative for Digital Agriculture (CIDA) and special thanks to Zach Guillian for help with data collection. Kaggle is excited to partner with research groups to push forward the frontier of machine learning. Research competitions make use of Kaggle's platform and experience, but are largely organized by the research group's data science team. Any questions or concerns regarding the competition data, quality, or topic will be addressed by them.`'",,Plant Pathology 2020 - FGVC7,,Identify the category of foliar diseases in apple trees,MCAUC,plant-pathology-2020-fgvc7 437,"'`Can you differentiate a weed from a crop seedling? The ability to do so effectively can mean better crop yields and better stewardship of the environment. The Aarhus University Signal Processing group, in collaboration with University of Southern Denmark, has recently released a dataset containing images of approximately 960 unique plants belonging to 12 species at several growth stages. We're hosting this dataset as a Kaggle competition in order to give it wider exposure, to give the community an opportunity to experiment with different image recognition techniques, as well to provide a place to cross-pollenate ideas. Acknowledgments We extend our appreciation to the Aarhus University Department of Engineering Signal Processing Group for hosting the original data. Citation A Public Image Database for Benchmark of Plant Seedling Classification Algorithms`'",,Plant Seedlings Classification,,Determine the species of a seedling from an image,MeanFScore,plant-seedlings-classification 438,"'`Introduo Prezados Alunos, Bem vindos primeira etapa da primeira tarefa de PMR3508. Esta tarefa est dividada em duas etapas. A primeira consiste em participar desta competio fechada, em um ambiente seguro para que vocs se familiarizem com o Kaggle, com o Python e com suas ferramentas. A segunda parte, consiste em se inscrever na competio [Costa Rican Household Poverty Level Prediction](https://www.kaggle.com/c/costa-rican-household-poverty-prediction) e fazer o mesmo procedimento que ser feito aqui. Procedimento O procedimento para execuo desta tarefa bem simples. Espera-se que cada aluno faa um notebook do Jupyter que contenha ao menos duas seces: Uma de explorao de dados, em que ele observa como so distribudas as variveis do dataset, quais valores elas assumem etc; a outra dedicada avaliao do impacto da seleo de variveis e feature engineering e seleo do parmetro K, do algoritmo K-Nearest Neighbors(KNN), sobre a acurcia do seu classificador. A submisso das suas previses para o set de teste opcional, mas encorajada, j que, para incentivar um pouco de competitividade saudvel, o monitor promete uma barra de chocolate Lindt ( de 125 gramas) ao estudante vencedor da leaderboard que tiver seu resultado comprovado no kernel publicado. Em caso de veganismo, diabetes, intolerncia a lactose ou outro imprevisto que prejudique o usufruto o prmio pelo vencedor, um substituto de valor similar ser negociado em tempo hbil. ATENO: A nota deste exerccio ser dada com base no KERNEL SUMETIDO. Caso o aluno submeta um resultado para a leaderboard, mas no fornea o kernel, ELE NO S NO ESTAR COMPETINDO PELO CHOCOLATE, COMO RECEBER NOTA ZERO NESTA ATIVIDADE Instrues Na tab ""Data"" vocs podero encontrar os arquivos contendo os dados e suas respectivas descries. Aps a execuo de sua tarefa, salve seu notebook (em formato .ipynb) e clique na aba ""Kernels"". L, clique em ""new kernel"" e siga as instrues para submeter seu kernel na competio. Em seguida, proceda de igual forma para a segunda parte deste exerccio. Agradecimentos especiais ao Repositrio UCI por ter providenciado este dataset ao domnio pblico. Dicas 1) A documentao do pandas sua amiga 2) O comando pd.describe() muito til para ter uma noo dos seus dados 3) A biblioteca ScikitLearn possui vrios tutoriais interessantes em seu site ( http://scikit-learn.org/stable/index.html ) 4) Lembrem-se de fazer validao cruzada de seus resultados!!!!`'",,PMR3508 - Tarefa 1 - 3508 Adult Dataset,inClass,Introdução ao Kaggle e ao algoritmo k-means,categorizationaccuracy,pmr3508-tarefa-1-3508-adult-dataset 439,"'`Data Science London and the UK Windows Azure Users Group in partnership with Microsoft and Peerindex, announce the Influencers in Social Networks competition as part of The Big Data Hackathon. The dataset, provided by Peerindex, comprises a standard, pair-wise preference learning task. Each datapoint describes two individuals, A and B. For each person, 11 p The binary label represents a human judgement about which one of the two individuals is more influential. A label '1' means A is more influential than B. 0 means B is more influential than A. The goal of the challenge is to train a machine learning model which, for pairs of individuals, predicts the human judgement on who is more influential with high accuracy. Labels for the dataset have been collected by PeerIndex using an application similar to the one described in this post. A python script computing a sample benchmark solution is available here: Competition begins: Saturday, Apr 13, 1pm BST (12 noon UTC) This competition awards 25% the ranking points of a standard competition, but does not count towards tiers. `'",tabular data,Influencers in Social Networks,featured,Predict which people are influential in a social network,AUC,influencers-in-social-networks 440,"'`StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as ""ephemeral"" or ""evergreen"". The ratings we get from our community give us strong signals that a page may no longer be relevant - but what if we could make this distinction ahead of time? A high quality prediction of ""ephemeral"" or ""evergreen"" would greatly improve a recommendation system like ours. Many people know evergreen content when they see it, but can an algorithm make the same determination without human intuition? Your mission is to build a classifier which will evaluate a large set of URLs and label them as either evergreen or ephemeral. Can you out-class(ify) StumbleUpon? As an added incentive to the prize, a strong performance in this competition may lead to a career-launching internship at one of the best places to work in San Francisco.`'",tabular data,StumbleUpon Evergreen Classification Challenge,featured,Build a classifier to categorize webpages as evergreen or non-evergreen,AUC,stumbleupon-evergreen-classification-challenge 441,"'`With over 1,200 quick service restaurants across the globe, TFI is the company behind some of the world's most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and Arbys. They employ over 20,000 people in Europe and Asia and make significant daily investments in developing new restaurant sites. Right now, deciding when and where to open new restaurants is largely a subjective process based on the personal judgement and experience of development teams. This subjective data is difficult to accurately extrapolate across geographies and cultures. New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred. Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI to invest more in other important business areas, like sustainability, innovation, and training for new employees. Using demographic, real estate, and commercial data, this competition challenges you to predict the annual restaurant sales of 100,000 regional locations. TFI would love to hire an expert Kaggler like you to head up their growing data science team in Istanbul or Shanghai. You'd be tackling problems like the one featured in this competition on a global scale. See the job description here >>`'",tabular data,Restaurant Revenue Prediction,featured,Predict annual restaurant sales based on objective measurements,RMSE,restaurant-revenue-prediction 442,"'`West Nile virus is most commonly spread to humans through infected mosquitos. Around 20% of people who become infected with the virus develop symptoms ranging from a persistent fever, to serious neurological illnesses that can result in death. In 2002, the first human cases of West Nile virus were reported in Chicago. By 2004 the City of Chicago and the Chicago Department of Public Health (CDPH) had established a comprehensive surveillance and control program that is still in effect today. Every week from late spring through the fall, mosquitos in traps across the city are tested for the virus. The results of these tests influence when and where the city will spray airborne pesticides to control adult mosquito populations. Given weather, location, testing, and spraying data, this competition asks you to predict when and where different species of mosquitos will test positive for West Nile virus. A more accurate method of predicting outbreaks of West Nile virus in mosquitos will help the City of Chicago and CPHD more efficiently and effectively allocate resources towards preventing transmission of this potentially deadly virus. We've jump-started your analysis with some visualizations and starter code in R and Python on Kaggle Scripts. No data download or local environment setup needed! Acknowledgements This competition is sponsored by the Robert Wood Johnson Foundation. Data is provided by the Chicago Department of Public Health.`'",tabular data,West Nile Virus Prediction,featured,Predict West Nile virus in mosquitos across the city of Chicago,AUC,west-nile-virus-prediction 443,"'`Like most companies, Red Hat is able to gather a great deal of information over time about the behavior of individuals who interact with them. Theyre in search of better methods of using this behavioral data to predict which individuals they should approachand even when and how to approach them. In this competition, Kagglers are challenged to create a classification algorithm that accurately identifies which customers have the most potential business value for Red Hat based on their characteristics and activities. With an improved prediction model in place, Red Hat will be able to more efficiently prioritize resources to generate more business and better serve their customers.`'",tabular data,Predicting Red Hat Business Value,featured,Classify customer potential,AUC,predicting-red-hat-business-value 444,"'`Nothing is more comforting than being greeted by your favorite drink just as you walk through the door of the corner caf. While a thoughtful barista knows you take a macchiato every Wednesday morning at 8:15, its much more difficult in a digital space for your preferred brands to personalize your experience. TalkingData, Chinas largest third-party mobile data platform, understands that everyday choices and behaviors paint a picture of who we are and what we value. Currently, TalkingData is seeking to leverage behavioral data from more than 70% of the 500 million mobile devices active daily in China to help its clients better understand and interact with their audiences. In this competition, Kagglers are challenged to build a model predicting users demographic characteristics based on their app usage, geolocation, and mobile device properties. Doing so will help millions of developers and brand advertisers around the world pursue data-driven marketing efforts which are relevant to their users and catered to their preferences. Acknowledgements`'",tabular data,TalkingData Mobile User Demographics,featured,Get to know millions of mobile device users,MulticlassLoss,talkingdata-mobile-user-demographics 445,"'`All was well in Santa's workshop. The gifts were made, the route was planned, the naughty and nice list complete. Santa thought this would finally be the year he didn't need Kaggle's help with his combinatorial conundrums. At last, the Claus family could take the elves and reindeer on that well deserved vacation to the South Pole. Then, with just days until the big night, Santa received an email from a panicked database admin elf. Attached was a server log with the six least jolly words a jolly old St. Nick could read: ALTER TABLE Gifts DROP COLUMN Weight One of the North Pole elf interns had mistakenly deleted the weights for all of the inventory in the workshop! Santa didn't have a backup (remember, this is a guy who makes a list and checks it twice) and, without knowing each present's weight, he didn't know how he would safely pack his many gift bags. Gifts were already on their way to the sleigh packing facility and there wasn't time to re-weigh all the presents. It was once again necessary to summon the holiday talents of Kaggle's elite. Can you help Santa fill his multiple bags with sets of uncertain gifts? Save the season by turning Santa's uncertain probabilities into presents for good little boys and girls.`'",tabular data,Santa's Uncertain Bags,playground,"♫ Bells are ringing, children singing, all is merry and bright. Santa's elves made a big mistake, now he needs your help tonight ♫",SantaWeightedBins,santas-uncertain-bags 446,"'`Estimating supply and demand is one of the most important problems for online transportation companies. Many of these companies' large scale business strategies depend on being able to accurately predict supply and demand at any point in time. For instance knowing that during certain times in the day the number of requests for rides exceeds the number of available drivers might lead the compnay to encourage drivers to work more during those times by providing incentives. In this challenge you are provided with data indicating the number of requests for rides per hour in different areas of Tehran spanning a period of several weeks. Note that the data provided to you for this challenge contains the actual number of requests observed by Tap30 and you are facing a problem with real world data.`'",tabular data,Tap30 Challenge,inClass,Online Taxi Demand Prediction,rmse,tap30-challenge 447,"'`Tis the night before Christmas year: two thousand seventeen. Santas grown grouchy, borderline mean. What used to be simple for Old St. Nick, is now too puzzling, its making him sick! See, Santa always knew, deep down in his gut, what toy each kid wantedno ifs, ands, or buts. But fierce population growth, more twins, and toy innovation, has left too complex a problem, in dire need of optimization. Dont worry, Mr. Santa, said an Elf named McMaggle, I have a solution! Have you heard of Kaggle? As she explained Kaggle in-depth, Santas doubt began turning, he became a believer in the magic of...machine learning. So, Santas team needs YOU more than ever this year, to solve this painful problem and save Christmas cheer. The Challenge In this playground competition, youre challenged to build a toy matching algorithm that maximizes happiness by pairing kids with toys they want. In the dataset, each kid has 10 preferences for their gift (from 1000) and Santa has 1000 preferred kids for every gift available. What makes this extra difficult is that 0.4% of the kids are twins, and by their parents request, require the same gift.`'",tabular data,Santa Gift Matching Challenge,featured,Down through the chimney with lots of toys...,SantaResident,santa-gift-matching-challenge 448,"'`Imagine suddenly gasping for air, helplessly breathless for no apparent reason. Could it be a collapsed lung? In the future, your entry in this competition could predict the answer. Pneumothorax can be caused by a blunt chest injury, damage from underlying lung disease, or most horrifyingit may occur for no obvious reason at all. On some occasions, a collapsed lung can be a life-threatening event. Pneumothorax is usually diagnosed by a radiologist on a chest x-ray, and can sometimes be very difficult to confirm. An accurate AI algorithm to detect pneumothorax would be useful in a lot of clinical scenarios. AI could be used to triage chest radiographs for priority interpretation, or to provide a more confident diagnosis for non-radiologists. The Society for Imaging Informatics in Medicine (SIIM) is the leading healthcare organization for those interested in the current and future use of informatics in medical imaging. Their mission is to advance medical imaging informatics across the enterprise through education, research, and innovation in a multi-disciplinary community. Today, they need your help. In this competition, youll develop a model to classify (and if present, segment) pneumothorax from a set of chest radiographic images. If successful, you could aid in the early recognition of pneumothoraces and save lives. If youre up for the challenge, take a deep breath, and get started now. Note: As specified on the Data Page, the dataset must be retrieved from Cloud Healthcare. Review this tutorial (or in pdf format) for instructions on how to do so. Acknowledgments SIIM Machine Learning Committee Co-Chairs, Steven G. Langer, PhD, CIIP and George Shih, MD, MS for tirelessly leading this effort and making the challenge possible in such a short period of time. SIIM Machine Learning Committee Members for their dedication in annotating the dataset, helping to define the most useful metrics and running tests to prepare the challenge for launch. SIIM Hackathon Committee, especially Mohannad Hussain, for their crucial technical support with data conversion. American College of Radiology (ACR), @RadiologyACR: For Co-hosting the challenge and Co-sponsoring the Prizes Society of Thoracic Radiology (STR), @thoracicrad: For their unparalleled expertise in adjudicating the dataset MD.ai: For providing the annotation tool and helping with the first layer of annotations`'",image data,SIIM-ACR Pneumothorax Segmentation,featured,Identify Pneumothorax disease in chest x-rays ,SIIMDice,siim-acr-pneumothorax-segmentation 449,"'`South African presidents - who said what? In this competition, you will need to build a sentence classifier to predict the speaker of a sentence from a South African State of the Nation address. The challenge The data for this competition consists of all transcripts of SONA speeches between 1990 - 2018. Approximately 30% of the sentences have been extracted from each speech and saved to a separate testing set. The training set is a csv, with one labelled row per speech. Unlike the testing set, these have not been split to sentence level. You will need to break the speeches up into individual labelled sentences and train a model to classify the sentences in the test set. The prize The top scoring submission will receive a cash prize of R1000.`'",text data,Whose line is it anyway?,inClass,"Given a line from a state of the nation address, try to predict which president said it!",categorizationaccuracy,whose-line-is-it-anyway? 450,"'`This is the home page of the competition. You don't need a subtitle here. The competition sub-title will appear above. This is where you introduce the problem. You can upload images using the ""select files"" widget on the left in the competition wizard. Upload an image, refresh the page, copy its URL, then insert within the wizard's editor. If you are copy-pasting from another application, like Word or your browser, try to make sure the html formatting is clean. You can view a page's html using the button at the top right of the editor's toolbar. This is a subtitle To format pages, stick to the following conventions: Paragraphs should go in p tags Code should go in pre tags Subtitles should go in h2 tags You can display equations using LaTeX enclosed in escaped brackets. For example, this: \[ \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } \] is created by this: \[ \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } \] Acknowledgements We thank Professor Plum, Ph.D. for providing this dataset.`'",tabular data,SQL Saturday Madrid ML Challenge,inClass,Demuestra lo que sabes de Machine Learning con PASS España!,meanfscore,sql-saturday-madrid-ml-challenge 451,"'`The main goal of this inclass competition is to identify sentiments of ~2000 news based on predefined training set of ~8000 news by implementing model using Machine Learning algorithms and techniques. The main language of news is Russian. However, some English and Kazakh names and titles can also be found.`'",text data,Sentiment Analysis in Russian,inClass,"Determine sentiments (positive, negative or neutral) of news in russian language.",meanfscore,sentiment-analysis-in-russian 452,"'`Pycon Korea 2018 Tutorial . https://www.pycon.kr/2018/program/tutorial/13 Kaggle Forest Cover Type. Multiclass Binary Classification . Random forests? Cover trees? Not so fast, computer nerds. We're talking about the real thing. In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from cartographic variables. The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices. Acknowledgements This dataset was provided by Jock A. Blackard and Colorado State University. We also thank the UCI machine learning repository for hosting the dataset. If you use the problem in publication, please cite: Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science`'",tabular data,Pycon Korea 2018 - Tutorial,inClass,Pycon Korea 2018 - 미운 우리 캐글 Tutorial 공간입니다.,logloss,pycon-korea-2018-tutorial 453,"'`Your friend bailed last minute on poker night? Before giving up on a much-needed evening of bad bluffs and quarter buy ins, light a cigar and get familiar with the rules of the game. Each record in this competition consists of five playing cards and an attribute representing the poker hand. You are asked to predict the best hand you can play based on the cards you've been dealt. The order of cards is important, which means there are 480 possible Royal Flush hands instead of just four. Identify those, and the other 311,875,200 possible hands correctly, and youre in the money! ""Isn't this easy? I know two-of-a-kind when I see it"", you might rightfully wonder. And you'd be right. The intent of this challenge is automatic rules induction, i.e. to learn the rules using machine learning, without hand coding heuristics. Pretend you are in a foreign land, have never played the game before, are given a history of thousands of games, and are asked to come up with the rules. It is potentially difficult to discover rules that can correctly classify poker hands, yet it is trivial for a human to validate the rules objectively. Remember, your algorithm will need to find rules that are general enough to be broadly useful, without being so broad that they end up being occasionally wrong. We suggest reading the paper by Cattral et al. for more background on the topic. Playground competitions are an opportunity to build and stretch your machine learning muscles. Pull up a chair to the data science poker table and ante up. Acknowledgements Kaggle is hosting this competition for the machine learning community to use for fun and practice. This dataset was created by Robert Cattral and Franz Oppacher. We also thank the UCI machine learning repository for hosting the dataset. If you use the problem in publication, please cite: Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science`'",,Poker Rule Induction,,Determine the poker hand of five playing cards,CategorizationAccuracy,poker-rule-induction 454,"'`This competition is now complete. Congratulations to the winners! Millions of programmers use Stack Overflow to get high quality answers to their programming questions every day. We take quality very seriously, and have evolved an effective culture of moderation to safe-guard it. With more than six thousand new questions asked on Stack Overflow every weekday we're looking to add more sophisticated software solutions to our moderation toolbox. Closing Questions Currently about 6% of all new questions end up ""closed"". Questions can be closed as off topic, not constructive, not a real question, or too localized. More in depth descriptions of each reason can be found in the Stack Overflow FAQ. The exact duplicate close reason has been excluded from this contest, since it depends on previous questions. Your goal is to build a classifier that predicts whether or not a question will be closed given the question as submitted, along with the reason that the question was closed. Additional data about the user at question creation time is also available.`'",,Predict Closed Questions on Stack Overflow,playground,Predict which new questions asked on Stack Overflow will be closed,MulticlassLossOld,predict-closed-questions-on-stack-overflow 455,"'`Kaggle Competition ini merupakan tugas akhir untuk mahasiswa yang mengikuti kelas Machine Learning pada semester genap tahun 2017/2018 di jurusan Teknik Informatika, Fakultas Teknologi Informasi, Universitas Tarumanagara Bisakah kalian membangun model yang dapat memprediksi jenis pohon berdasarkan informasi area sekelilingnya seperti ketinggian, jenis tanah, bayangan, dan lain sebagainya? Selamat bekerja! Acknowledgement. This dataset is part of the UCI Machine Learning Repository. The original database owners are Jock A. Blackard, Dr. Denis J. Dean, and Dr. Charles W. Anderson of the Remote Sensing and GIS Program at Colorado State.`'",,Predicting Forest Cover Type,inClass,Memprediksi jenis hutan,meanfscore,predicting-forest-cover-type 456,"'`Rules Practising of unfair means and submitting other participant's submission file will straight away lead to diqualification. Please don't get involved in any kind of piracy. Don't submit the files from other Kaggle Users as the data is specific to this competition only.`'",,Predict the Diabetes!,,"This competition is hosted by SDS,BIT-Mesra. The participants need to predict the diabetes from the data provided.",categorizationaccuracy,predict-the-diabetes! 457,"'`Competition Description Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. Practice Skills Creative feature engineering Advanced regression techniques like random forest and gradient boosting`'",,Predict the Housing Price,,"A regression problem includes the knowledge of Machine learning, regression techniques, and feature engineering.",rmse,predict-the-housing-price 458,"'`What if scientists could anticipate volcanic eruptions as they predict the weather? While determining rain or shine days in advance is more difficult, weather reports become more accurate on shorter time scales. A similar approach with volcanoes could make a big impact. Just one unforeseen eruption can result in tens of thousands of lives lost. If scientists could reliably predict when a volcano will next erupt, evacuations could be more timely and the damage mitigated. Currently, scientists often identify time to eruption by surveying volcanic tremors from seismic signals. In some volcanoes, this intensifies as volcanoes awaken and prepare to erupt. Unfortunately, patterns of seismicity are difficult to interpret. In very active volcanoes, current approaches predict eruptions some minutes in advance, but they usually fail at longer-term predictions. Enter Italy's Istituto Nazionale di Geofisica e Vulcanologia (INGV), with its focus on geophysics and volcanology. The INGV's main objective is to contribute to the understanding of the Earth's system while mitigating the associated risks. Tasked with the 24-hour monitoring of seismicity and active volcano activity across the country, the INGV seeks to find the earliest detectable precursors that provide information about the timing of future volcanic eruptions. In this competition, using your data science skills, youll predict when a volcano's next eruption will occur. You'll analyze a large geophysical dataset collected by sensors deployed on active volcanoes. If successful, your algorithms will identify signatures in seismic waveforms that characterize the development of an eruption. With enough notice, areas around a volcano can be safely evacuated prior to their destruction. Seismic activity is a good indicator of an impending eruption, but earlier precursors must be identified to improve longer-term predictability. The impact of your participation could be felt worldwide with tens of thousands of lives saved by more predictable volcanic ruptures and earlier evacuations.`'",,INGV - Volcanic Eruption Prediction,,Discover hidden precursors in geophysical data to help emergency response,MAE,ingv-volcanic-eruption-prediction 459,"'`Overview This research focus on targeting through telemarketing phone callsto sell long-term deposits. Within a campaign, the human agents execute phone calls to a list of clients to sell the deposit (outbound) or, if meanwhile the client calls the contact-center for any other reason, he is asked to subscribe the deposit (inbound). Thus, the result is a binary unsuccessful or successful contact. Data This study considers real data collected from a Portuguese retail bank, from May 2008 to June 2013, in a total of 52,944 phone contacts. The dataset is unbalanced, as only 6557 (12.38%) records are related with successes. For evaluation purposes, a time ordered split was initially performed, where the records were divided into training (four years) and test data (one year). The training data is used for feature and model selection and includes all contacts executed up to June 2012, in a total of 51,651 examples. The test data is used for measuring the prediction capabilities of the selected data-driven model, including the most recent 1293 contacts, from July 2012 to June 2013.`'",,Predicting Bank Telemarketing,,Goal: predict if the banking clients will subscribe a term deposit,meanfscore,predicting-bank-telemarketing 460,"'`SejongAI..[Prediction Of Sea Ice] [17013253]_[] [Introduction] : https://youtu.be/tn8Ele9MeFI [Description] .`'",,Prediction Of Sea Ice,inClass,According to carbon emission factors,rmse,prediction-of-sea-ice 461,"'`Goal Aim of the challenge The goal of this challenge is to predict whether or not some turbofan engines are going to break down within the next 100 cycles. The dataset consists of different multivariate time-series. These different time-series refer to different engines ( engineno in the dataset). The sampling of the time series is 1 point per engine cycle ( timein_cycles in the dataset). The dataset is split into train data and test data to evaluate your model. In the train dataset: the engine runs until failure. It means that for each data point we can associate the RUL (Remaining Useful Life in cycles). This column is present in the train dataset (RUL). In the test dataset: The engine runs until a certain point. What you need to predict is whether or not the engine is going to fail whithin the next 100 cycles. .`'",,Predictive maintenance,inClass,Nasa Turbofan Dataset,meanfscore,predictive-maintenance 462,"'`With more than 1 million new diagnoses reported every year, prostate cancer (PCa) is the second most common cancer among males worldwide that results in more than 350,000 deaths annually. The key to decreasing mortality is developing more precise diagnostics. Diagnosis of PCa is based on the grading of prostate tissue biopsies. These tissue samples are examined by a pathologist and scored according to the Gleason grading system. In this challenge, you will develop models for detecting PCa on images of prostate tissue samples, and estimate severity of the disease using the most extensive multi-center dataset on Gleason grading yet available. The grading process consists of finding and classifying cancer tissue into so-called Gleason patterns (3, 4, or 5) based on the architectural growth patterns of the tumor (Fig. 1). After the biopsy is assigned a Gleason score, it is converted into an ISUP grade on a 1-5 scale. The Gleason grading system is the most important prognostic marker for PCa, and the ISUP grade has a crucial role when deciding how a patient should be treated. There is both a risk of missing cancers and a large risk of overgrading resulting in unnecessary treatment. However, the system suffers from significant inter-observer variability between pathologists, limiting its usefulness for individual patients. This variability in ratings could lead to unnecessary treatment, or worse, missing a severe diagnosis. Automated deep learning systems have shown some promise in accurately grading PCa. Recent research, including two studies independently conducted by the groups hosting this challenge, have shown that these systems can achieve pathologist-level performance. However, these systems/results were not tested with multi-center datasets at scale. Your work here will improve on these efforts using the most extensive multi-center dataset on Gleason grading yet. The training set consists of around 11,000 whole-slide images of digitized H&E-stained biopsies originating from two centers. This is the largest public whole-slide image dataset available, roughly 8 times the size of the CAMELYON17 challenge, one of the largest digital pathology datasets and best known challenges in the field. Furthermore, in contrast to previous challenges, we are making full diagnostic biopsy images available. Using a sizable multi-center test set, graded by expert uro-pathologists, we will evaluate challenge submissions on their applicability to improve this critical diagnostic function. Figure 1: An illustration of the Gleason grading process for an example biopsy containing prostate cancer. The most common (blue outline, Gleason pattern 3) and second most common (red outline, Gleason pattern 4) cancer growth patterns present in the biopsy dictate the Gleason score (3+4 for this biopsy), which in turn is converted into an ISUP grade (2 for this biopsy) following guidelines of the International Society of Urological Pathology. Biopsies not containing cancer are represented by an ISUP grade of 0 in this challenge. Radboud University Medical Center and Karolinska Institute have teamed up to organize this competition in collaboration with colleagues from Tampere University. The Computational Pathology Group (CPG) of the Radboud University Medical Center is a research group that develops computer algorithms to aid clinicians. Karolinska Institutes Department of Medical Epidemiology and Biostatistics (MEB) includes an interdisciplinary research group to improve the diagnostics and treatment of prostate cancer. Together, they hope to further their existing research to make a significant impact on the healthcare of prostate cancer patients. Challenge organizer team: Wouter Bulten, Geert Litjens, Hans Pinckaers, Peter Strm, Martin Eklund, Lars Egevad, Henrik Grnberg, Kimmo Kartasalo, Pekka Ruusuvuori, Tomi Hkkinen, Sohier Dane, Maggie Demkin. Sponsors The PANDA workshop at MICCAI 2020 is sponsored by ContextVision, Ibex and Google. Using the data outside of the competition Interested in using the PANDA dataset outside of the competition? Please read this forum post for the latest information on the embargo and the challenge paper.`'",,Prostate cANcer graDe Assessment (PANDA) Challenge,,Prostate cancer diagnosis using the Gleason grading system,QuadraticWeightedKappa,prostate-cancer-grade-assessment-(panda)-challenge 463,"'`Picture this. You are a data scientist in a start-up culture with the potential to have a very large impact on the business. Oh, and you are backed up by a company with 140 years' business experience. Curious? Great! You are the kind of person we are looking for. Prudential, one of the largest issuers of life insurance in the USA, is hiring passionate data scientists to join a newly-formed Data Science group solving complex challenges and identifying opportunities. The results have been impressive so far but we want more. The Challenge In a one-click shopping world with on-demand everything, the life insurance application process is antiquated. Customers provide extensive information to identify risk classification and eligibility, including scheduling medical exams, a process that takes an average of 30 days. The result? People are turned off. Thats why only 40% of U.S. households own individual life insurance. Prudential wants to make it quicker and less labor intensive for new and existing customers to get a quote while maintaining privacy boundaries. By developing a predictive model that accurately classifies risk using a more automated approach, you can greatly impact public perception of the industry. The results will help Prudential better understand the predictive power of the data points in the existing assessment, enabling us to significantly streamline the process.`'",,Prudential Life Insurance Assessment,,Can you make buying life insurance easier?,QuadraticWeightedKappa,prudential-life-insurance-assessment 464,"'`So, where we droppin' boys and girls? Battle Royale-style video games have taken the world by storm. 100 players are dropped onto an island empty-handed and must explore, scavenge, and eliminate other players until only one is left standing, all while the play zone continues to shrink. PlayerUnknown's BattleGrounds (PUBG) has enjoyed massive popularity. With over 50 million copies sold, it's the fifth best selling game of all time, and has millions of active monthly players. The team at PUBG has made official game data available for the public to explore and scavenge outside of ""The Blue Circle."" This competition is not an official or affiliated PUBG site - Kaggle collected data made possible through the PUBG Developer API. You are given over 65,000 games' worth of anonymized player data, split into training and testing sets, and asked to predict final placement from final in-game stats and initial player ratings. What's the best strategy to win in PUBG? Should you sit in one spot and hide your way into victory, or do you need to be the top shot? Let's let the data do the talking!`'",,PUBG Finish Placement Prediction (Kernels Only),,Can you predict the battle royale finish of PUBG Players?,MAE,pubg-finish-placement-prediction-(kernels-only) 465,"'`Welcome to PWAIC's first contest of the year! Reminders You MUST use the K-Nearest Neighbors classification algorithm to complete this challenge. No neural networks please! Although Python is the recommended language, you may use any of your liking (Java, C/C++, etc.) You are NOT allowed to use Python libraries other than Pandas (pd) and Numpy (np) Deadline Submissions close on November 15, 2018 at 11:59pm CDT Note Regarding Leaderboard The public leaderboard is only reflective of 50% of the entire testing set. After the submission deadline ends, your program will be judged on the remaining 50% of the data. This is to prevent contestants from randomly adjusting hyperparameters to **overfit** the official testing data until their score peaks. Acknowledgements We thank Professor Plum, Ph.D. for providing this dataset.`'",,PWAIC KNN Contest!,inClass,Test you knowledge of the KNN Classification Algorithm with PWAIC's first ever in-house contest!,meanfscore,pwaic-knn-contest! 466,"'`Based on this older competition. Millions of programmers use Stack Overflow to get high quality answers to their programming questions every day. We take quality very seriously, and have evolved an effective culture of moderation to safe-guard it. With more than six thousand new questions asked on Stack Overflow every weekday we're looking to add more sophisticated software solutions to our moderation toolbox. Closing Questions Currently about 6% of all new questions end up ""closed"". Questions can be closed as off topic, not constructive, not a real question, or too localized. More in depth descriptions of each reason can be found in the Stack Overflow FAQ. The exact duplicate close reason has been excluded from this contest, since it depends on previous questions. Your goal is to build a classifier that predicts whether or not a question will be closed given the question as submitted, along with the reason that the question was closed. Additional data about the user at question creation time is also available.`'",,pycon-2015-tutorial,inClass,Competition for PyCon 2015 Kaggle Tutorial Based on Prior Competition with Stack Exchange,logloss,pycon-2015-tutorial 467,"'`Who doesn't enjoy the morning chirp of a bird or a frogs evening croak? Animals bring more than sweet songs and natural ambience to the world. The presence of rainforest species is a good indicator of the impact of climate change and habitat loss. As it's easier to hear these species than see them, its important to use acoustic technologies that can work on a global scale. Real-time information, such as provided through machine learning techniques, could enable early-stage detection of human impacts on the environment. This result could drive more effective conservation management decisions. Traditional methods of assessing the diversity and abundance of species are costly and limited in space and time. And while automatic acoustic identification via deep learning has been successful, models require a large number of training samples per species. This limits applicability to rarer species, which are central to conservation efforts. Thus, methods to automate high-accuracy species detection in noisy soundscapes with limited training data are the solution. Rainforest Connection (RFCx) created the worlds first scalable, real-time monitoring system for protecting and studying remote ecosystems. Unlike visual-based tracking systems like drones or satellites, RFCx relies on acoustic sensors that monitor the ecosystem soundscape at selected locations year round. RFCx technology has advanced to support a comprehensive biodiversity monitoring program that allows local partners to measure progress of wildlife restoration and recovery through principles of adaptive management. The RFCx monitoring platform also has the capacity to create convolutional neural network (CNN) models for analysis. In this competition, youll automate the detection of bird and frog species in tropical soundscape recordings. You'll create your models with limited, acoustically complex training data. Rich in more than bird and frog noises, expect to hear an insect or two, which your model will need to filter out. If successful, you'll have a hand in a rapidly expanding field of science: the development of automated eco-acoustic monitoring systems. The resulting real-time information could enable earlier detection of human environmental impacts, making environmental conservation more swift and effective.`'",,Rainforest Connection Species Audio Detection,,Automate the detection of bird and frog species in a tropical soundscape,WeightedLabelRankingAveragePrecision,rainforest-connection-species-audio-detection 468,"'`Rock, Paper, Scissors (sometimes called roshambo) has been a staple to settle playground disagreements or determine who gets to ride in the front seat on a road trip. The game is simple, with a balance of power. There are three options to choose from, each winning or losing to the other two. In a series of truly random games, each player would win, lose, and draw roughly one-third of games. But people are not truly random, which provides a fun opportunity for AI. Studies have shown that a Rock, Paper, Scissors AI can consistently beat human opponents. With previous games as input, it studies patterns to understand a players tendencies. But what happens when we expand the simple Best-of-3 game to be Best-of-1000? How well can artificial intelligence perform? In this simulation competition, you will create an AI to play against others in many rounds of this classic game. Can you find patterns to make yours win more often than it loses? Its possible to greatly outperform a random player when the matches involve non-random agents. A strong AI can consistently beat predictable AI. This problem is fundamental to the fields of machine learning, artificial intelligence, and data compression. There are even potential applications in human psychology and hierarchical temporal memory. Warm up your hands and get ready to Rock, Paper, Scissors in this challenge. Image acknowledgements: Photos from The Noun Project: Rock, Paper, Scissors`'",,"Rock, Paper, Scissors",,Shoot!,rps,"rock,-paper,-scissors" 469,"'` , , . ( , ). . train.csv . id . . . - , . : #, : ""NOUN#Animacy=Inan|Case=Gen|Gender=Neut|Number=Plur"". , . = '|'.`'",,Russian POS-tagging,inClass,Predict Russian Universal Dependencies POS tags,categorizationaccuracy,russian-pos-tagging 470,"'`Santa was thrilled with the Kaggle community for minimizing his workshop costs! He had heard rumors that Kagglers were adept at cracking holiday challenges, but, wow, even Santa was surprised at this one. Unfortunately, the North Pole accountants were less pleased. It turns out, the accountants didn't like being one-upped by machine learning experts on the internet. To complicate matters, they've decided to allow an additional 1,000 families attend the workshop. And they've also ""fine tuned"" their accounting formula to try and trip up those fancy solvers some people have at their disposal. Of course, we know that nothing trips up the Kaggle community! (Well, except for maybe over-fitting. But fortunately, that doesn't apply here!) So this is a bonus Santa competition for those who want an additional challenge and the opportunity to continue to improve their optimization skills. Since Santa used up all his budget on accounting fees, this is strictly a Playground competition, with the chance to win some coveted Kaggle Swag. Have fun, and Happy Holidays from the Kaggle Team! Attribution Banner/Listing Photo by Helloquence on Unsplash`'",,Santa 2019 - Revenge of the Accountants,,Oh what fun it is to revise . . .,SantaWorkshopSchedule2019Revenge,santa-2019-revenge-of-the-accountants 471,"'`It's the most wonderful time of the year With the elves eating candy Theyll feel super dandy and be of good cheer It's the most wonderful time of the year It's the hap-happiest season of all When spirits are lifted the toys will be gifted And games to enthrall! It's the hap-happiest season of all The party for throwing Has snow cones aglowing With bragging rights out on display. So now you must plan it, To beat the armed bandits who keep all the candy away. It's the most wonderful time of the year! Morale has been low at the North Pole this year. But Santa really believes in making spirits bright! So he has planned a friendly competition among the elves to keep the Christmas cheer alive and make as many toys as possible! And the winning team gets a snow cone party! As one of the team leaders, you know that nothing keeps your fellow elves more productive and motivated than a steady supply of candy canes! But all seven levels of the Candy Cane Forest are closed for revegetation, so the only ones available are stuck in the break room vending machines. And even though you receive free snacks on the job, the vending machines are always broken and dont always give you what you want. Due to social distancing, only two elves can be in the break room at once. You and another team leader will take turns trying to get candy canes out of the 100 possible vending machines in the room, but each machine is unpredictable in how likely it is to work. You do know, however, that the more often you try to use a machine, the less likely it will give you a candy cane. Plus, you only have time to try 2000 times on the vending machines until you need to get back to the workshop! If you can collect more candy canes than the other team leaders, youll surely be able to help your team win Santa's contest! Try your hand at this multi-armed candy cane challenge! Image Credit: Photos by Joanna Kosinska and Misty Ladd on Unsplash.`'",,Santa 2020 - The Candy Cane Contest,,May your workdays be merry and bright,mab,santa-2020-the-candy-cane-contest 472,"'`Hammers ring, are you listenin In the shop, toys are glistenin Should they see the sights? There might be a fight Walkin round the Workshop Wonderland Families said, they want to see it Santa said, hed guarantee it They pick a date But they may have to wait Walkin round the Workshop Wonderland We told Santa that he was a madman He just wants to make sure they all smile Hell say Are you flexible?, Theyll say Yeah man, But can you help us make it worth our while? Give them food, or sweater the more they wait, the gifts get better Please help us rank Or well break the bank! Walkin round the Workshop Wonderland Santa has exciting news! For 100 days before Christmas, he opened up tours to his workshop. Because demand was so strong, and because Santa wanted to make things as fair as possible, he let each of the 5,000 families that will visit the workshop choose a list of dates they'd like to attend the workshop. Now that all the families have sent Santa their preferences, he's realized it's impossible for everyone to get their top picks, so he's decided to provide extra perks for families that don't get their preferences. In addition, Santa's accounting department has told him that, depending on how families are scheduled, there may be some unexpected and hefty costs incurred. Santa needs the help of the Kaggle community to optimize which day each family is assigned to attend the workshop in order to minimize any extra expenses that would cut into next years toy budget! Can you help Santa out? Attribution Banner/Listing Photo by Nathan Lemon on Unsplash Description Photo by Markus Spiske on Unsplash`'",,Santa's Workshop Tour 2019,,"In the notebook we can build a model, and pretend that it will optimize...",SantaWorkshopSchedule2019,santas-workshop-tour-2019 473,"'`From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving. Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late. In this competition, you'll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience.`'",,Santander Customer Satisfaction,,Which customers are happy customers?,AUC,santander-customer-satisfaction 474,"'`At Santander our mission is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals. Our data science team is continually challenging our machine learning algorithms, working with the global data science community to make sure we can more accurately identify new ways to solve our most common challenge, binary classification problems such as: is a customer satisfied? Will a customer buy this product? Can a customer pay this loan? In this challenge, we invite Kagglers to help us identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this problem.`'",,Santander Customer Transaction Prediction,,Can you identify who will make a transaction?,AUC,santander-customer-transaction-prediction 475,"'`Ready to make a downpayment on your first house? Or looking to leverage the equity in the home you have? To support needs for a range of financial decisions, Santander Bank offers a lending hand to their customers through personalized product recommendations. Under their current system, a small number of Santanders customers receive many recommendations while many others rarely see any resulting in an uneven customer experience. In their second competition, Santander is challenging Kagglers to predict which products their existing customers will use in the next month based on their past behavior and that of similar customers. With a more effective recommendation system in place, Santander can better meet the individual needs of all customers and ensure their satisfaction no matter where they are in life. Disclaimer: This data set does not include any real Santander Spain's customer, and thus it is not representative of Spain's customer base. `'",,Santander Product Recommendation,,Can you pair products with people?,MAP@{K},santander-product-recommendation 476,"'`According to Epsilon research, 80% of customers are more likely to do business with you if you provide personalized service. Banking is no exception. The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner and often before theyve even realized they need the service. In their 3rd Kaggle competition, Santander Group aims to go a step beyond recognizing that there is a need to provide a customer a financial service and intends to determine the amount or value of the customer's transaction. This means anticipating customer needs in a more concrete, but also simple and personal way. With so many choices for financial services, this need is greater now than ever before. In this competition, Santander Group is asking Kagglers to help them identify the value of transactions for each potential customer. This is a first step that Santander needs to nail in order to personalize their services at scale.`'",,Santander Value Prediction Challenge,,Predict the value of transactions for potential customers.,RMSLE,santander-value-prediction-challenge 477,"'`Fork this script and get started on the problem The North Pole is in an uproar over news that Santa's magic sleigh has been stolen. Able to carry all the world's presents in one trip, it was considered crucial to successfully delivering holiday goodies across the globe in one night. Unwilling to cancel Christmas, Santa is determined to deliver toys to all the good girls and boys using his day-to-day, magic-less sleigh. With so little time to pull off this plan, Santa is once again counting on Kagglers to help. Given the sleigh's antiquated, weight-limited specifications, your challenge is to optimize the routes and loads Santa will take to and from the North Pole. And don't forget about Dasher, Dancer, Prancer, and Vixen; Santa is adamant that the best solutions will minimize the toll of this hectic night on his reindeer friends. Acknowledgements This competition is brought to you by FICO.`'",,Santa's Stolen Sleigh,,"♫ Alarm bells ring, are you listening? Santa's sleigh has gone missing ♫",SantaRideShare,santas-stolen-sleigh 478,"'`Housing costs demand a significant investment from both consumers and developers. And when it comes to planning a budgetwhether personal or corporatethe last thing anyone needs is uncertainty about one of their biggets expenses. Sberbank, Russias oldest and largest bank, helps their customers by making predictions about realty prices so renters, developers, and lenders are more confident when they sign a lease or purchase a building. Although the housing market is relatively stable in Russia, the countrys volatile economy makes forecasting prices as a function of apartment characteristics a unique challenge. Complex interactions between housing features such as number of bedrooms and location are enough to make pricing predictions complicated. Adding an unstable economy to the mix means Sberbank and their customers need more than simple regression models in their arsenal. In this competition, Sberbank is challenging Kagglers to develop algorithms which use a broad spectrum of features to predict realty prices. Competitors will rely on a rich dataset that includes housing data and macroeconomic patterns. An accurate forecasting model will allow Sberbank to provide more certainty to their customers in an uncertain economy.`'",,Sberbank Russian Housing Market,,Can you predict realty price fluctuations in Russia’s volatile economy?,RMSLE,sberbank-russian-housing-market 479,"'`We all have a heart. Although we often take it for granted, it's our heart that gives us the moments in life to imagine, create, and discover. Yet cardiovascular disease threatens to take away these moments. Each day, 1,500 people in the U.S. alone are diagnosed with heart failurebut together, we can help. We can use data science to transform how we diagnose heart disease. By putting data science to work in the cardiology field, we can empower doctors to help more people live longer lives and spend more time with those that they love. Declining cardiac function is a key indicator of heart disease. Doctors determine cardiac function by measuring end-systolic and end-diastolic volumes (i.e., the size of one chamber of the heart at the beginning and middle of each heartbeat), which are then used to derive the ejection fraction (EF). EF is the percentage of blood ejected from the left ventricle with each heartbeat. Both the volumes and the ejection fraction are predictive of heart disease. While a number of technologies can measure volumes or EF, Magnetic Resonance Imaging (MRI) is considered the gold standard test to accurately assess the heart's squeezing ability. The challenge with using MRI to measure cardiac volumes and derive ejection fraction, however, is that the process is manual and slow. A skilled cardiologist must analyze MRI scans to determine EF. The process can take up to 20 minutes to completetime the cardiologist could be spending with his or her patients. Making this measurement process more efficient will enhance doctors' ability to diagnose heart conditions early, and carries broad implications for advancing the science of heart disease treatment. The 2015 Data Science Bowl challenges you to create an algorithm to automatically measure end-systolic and end-diastolic volumes in cardiac MRIs. You will examine MRI images from more than 1,000 patients. This data set was compiled by the National Institutes of Health and Children's National Medical Center and is an order of magnitude larger than any cardiac MRI data set released previously. With it comes the opportunity for the data science community to take action to transform how we diagnose heart disease. This is not an easy task, but together we can push the limits of what's possible. We can give people the opportunity to spend more time with the ones they love, for longer than ever before. Acknowledgments The Data Science Bowl is presented by: The National Heart, Lung, and Blood Institute (NHLBI) provided the MRI images for this competition. Special thanks to NHLBI Intramural Investigators Dr. Michael Hansen and Dr. Andrew Arai. Additional support for the Data Science Bowl was provided by NVIDIA:`'",,Second Annual Data Science Bowl,,Transforming How We Diagnose Heart Disease,CRPS,second-annual-data-science-bowl 480,"'`This competition is the successor to the See Click Predict Fix Hackathon. The purpose of both competitions is to quantify and predict how people will react to a specific 311 issue. What makes an issue urgent? What do citizens really care about? How much does location matter? Being able to predict the most pressing 311 topics will allow governments to focus their efforts on fixing the most important problems. The data set for the competitions contains several hundred thousand 311 issues from four cities. For those who are more interested in using the data for visualization or ""non-predictive"" data mining, we have added a $500 visualization prize. You may submit as many entries as you wish via the Visualization page. If you're plotting issues on maps, displaying the text in some meaningful way, or making any other creative use of the data, save it and post it! About 311 311 is a mechanism by which citizens can express their desire to solve a problem the city or government by submitting a description of what needs to be done, fixed, or changed. In effect, this provides a high degree of transparency between government and its constituents. Once an issue has been established, citizens can vote and make comments on the issue so that government officials have some degree of awareness about what is the most important issue to address. Sponsors The meeting space has been provided by Microsoft. Prize money is graciously offered by our sponsors: On the citizen side, SeeClickFix leverages crowdsourcing to help both maintain the flow of incoming requests but show the public how effective you can be. When anyone in the community can report or comment on any issue, the entire group has a better perspective on what's happening--and how to fix it effectively. For governments, SeeClickFix acts as a completely-customizable CRM that plugs into your existing request management tools. From types of service requests to managing different watch areas, SeeClickFix helps better m A public policy entrepreneur and open innovation expert David advises numerous governments on open government and open data and works with leading non-profits and businesses on strategy, open innovation and community management. In addition to his work, David is an affiliate with the Berkman Centre for Internet and Society at Harvard where he is looking at issues surrounding the politics of data. You can find David's writing on open innovation, public policy, public sector renewal and open source systems at his blog, or at TechPresident. In addition to his writing, David is frequently invited to speak on open government, policy making, negotiation and strategy to executives, policymakers, and students. You can read a background on how this challenge came to be here.`'",,See Click Predict Fix,,Predict which 311 issues are most important to citizens,RMSLE,see-click-predict-fix 481,"'`Seizure forecasting systems hold promise for improving the quality of life for patients with epilepsy. Epilepsy afflicts nearly 1% of the world's population, and is characterized by the occurrence of spontaneous seizures. For many patients, anticonvulsant medications can be given at sufficiently high doses to prevent seizures, but patients frequently suffer side effects. For 20-40% of patients with epilepsy, medications are not effective -- and even after surgical removal of epilepsy-causing brain tissue, many patients continue to experience spontaneous seizures. Despite the fact that seizures occur infrequently, patients with epilepsy experience persistent anxiety due to the possibility of a seizure occurring. Seizure forecasting systems have the potential to help patients with epilepsy lead more normal lives. In order for EEG-based seizure forecasting systems to work effectively, computational algorithms must reliably identify periods of increased probability of seizure occurrence. If these seizure-permissive brain states can be identified, devices designed to warn patients of impeding seizures would be possible. Patients could avoid potentially dangerous activities like driving or swimming, and medications could be administered only when needed to prevent impending seizures, reducing overall side effects. There is emerging evidence that the temporal dynamics of brain activity can be classified into 4 states: Interictal (between seizures, or baseline), Preictal (prior to seizure), Ictal (seizure), and Post-ictal (after seizures). Seizure forecasting requires the ability to reliably identify a preictal state that can be differentiated from the interictal, ictal, and postictal state. The primary challenge in seizure forecasting is differentiating between the preictal and interictal states. The goal of the competition is to demonstrate the existence and accurate classification of the preictal brain state in dogs and humans with naturally occurring epilepsy. The Competition Intracranial EEG was recorded from dogs with naturally occurring epilepsy using an ambulatory monitoring system. EEG was sampled from 16 electrodes at 400 Hz, and recorded voltages were referenced to the group average. These are long duration recordings, spanning multiple months up to a year and recording up to a hundred seizures in some dogs. In addition, datasets from patients with epilepsy undergoing intracranial EEG monitoring to identify a region of brain that can be resected to prevent future seizures are included in the contest. These datasets have varying numbers of electrodes and are sampled at 5000 Hz, with recorded voltages referenced to an electrode outside the brain. The challenge is to distinguish between ten minute long data clips covering an hour prior to a seizure, and ten minute iEEG clips of interictal activity. Seizures are known to cluster, or occur in groups. Patients who typically have seizure clusters receive little benefit from forecasting follow-on seizures. For this contest only lead seizures, defined here as seizures occurring four hours or more after another seizure, are included in the training and testing data sets. In order to avoid any potential contamination between interictal, preictal, and post-ictal EEG signals interictal segments in the canine training and test data were restricted to be at least one week before or after any seizure. In the human data, where the entire monitoring session may last less than one week, interictal data segments were restricted to be at least four hours before or after any seizure. Interictal data segments were chosen at random within these restrictions for both canine and human subjects. Participants are invited to visit the NIH-sponsored International Epilepsy Electrophysiology portal (http://ieeg.org) to review and download annotated interictal and preictal data from other patients and animal subjects. Using ieeg.org data for additional algorithm training is permitted. Acknowledgements This competition is sponsored by the National Institutes of Health (NINDS), the Epilepsy Foundation, and the American Epilepsy Society. References Howbert JJ, Patterson EE, Stead SM, Brinkmann B, Vasoli V, Crepeau D, Vite CH, Sturges B, Ruedebusch V, Mavoori J, Leyde K, Sheffield WD, Litt B, Worrell GA (2014) Forecasting seizures in dogs with naturally occurring epilepsy. PLoS One 9(1):e81920. Cook MJ, O'Brien TJ, Berkovic SF, Murphy M, Morokoff A, Fabinyi G, D'Souza W, Yerra R, Archer J, Litewka L, Hosking S, Lightfoot P, Ruedebusch V, Sheffield WD, Snyder D, Leyde K, Himes D (2013) Prediction of seizure likelihood with a long-term, implanted seizure advisory system in patients with drug-resistant epilepsy: a first-in-man study. LANCET NEUROL 12:563-571. Park Y, Luo L, Parhi KK, Netoff T (2011) Seizure prediction with spectral power of EEG using cost-sensitive support vector machines. Epilepsia 52:1761-1770. Davis KA, Sturges BK, Vite CH, Ruedebusch V, Worrell G, Gardner AB, Leyde K, Sheffield WD, Litt B (2011) A novel implanted device to wirelessly record and analyze continuous intracranial canine EEG. Epilepsy Res 96:116-122. Andrzejak RG, Chicharro D, Elger CE, Mormann F (2009) Seizure prediction: Any better than chance? Clin Neurophysiol. Snyder DE, Echauz J, Grimes DB, Litt B (2008) The statistics of a practical seizure warning system. J Neural Eng 5: 392401. Mormann F, Andrzejak RG, Elger CE, Lehnertz K (2007) Seizure prediction: the long and winding road. Brain 130: 314333. Haut S, Shinnar S, Moshe SL, O'Dell C, Legatt AD. (1999) The association between seizure clustering and status epilepticus in patients with intractable complex partial seizures. Epilepsia 40:18321834.`'",,American Epilepsy Society Seizure Prediction Challenge,,Predict seizures in intracranial EEG recordings,AUC,american-epilepsy-society-seizure-prediction-challenge 482,"'`Steel is one of the most important building materials of modern times. Steel buildings are resistant to natural and man-made wear which has made the material ubiquitous around the world. To help make production of steel more efficient, this competition will help identify defects. Severstal is leading the charge in efficient steel mining and production. They believe the future of metallurgy requires development across the economic, ecological, and social aspects of the industryand they take corporate responsibility seriously. The company recently created the countrys largest industrial data lake, with petabytes of data that were previously discarded. Severstal is now looking to machine learning to improve automation, increase efficiency, and maintain high quality in their production. The production process of flat sheet steel is especially delicate. From heating and rolling, to drying and cutting, several machines touch flat steel by the time its ready to ship. Today, Severstal uses images from high frequency cameras to power a defect detection algorithm. In this competition, youll help engineers improve the algorithm by localizing and classifying surface defects on a steel sheet. If successful, youll help keep manufacturing standards for steel high and enable Severstal to continue their innovation, leading to a stronger, more efficient world all around us.`'",,Severstal: Steel Defect Detection,,Can you detect and classify defects in steel?,Dice,severstal:-steel-defect-detection 483,"'`From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay. From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred. We're also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes. Acknowledgements Kaggle is hosting this competition for the machine learning community to use for fun and practice. This dataset is brought to you by SF OpenData, the central clearinghouse for data published by the City and County of San Francisco.`'",,San Francisco Crime Classification,,Predict the category of crimes that occurred in the city by the bay ,MulticlassLoss,san-francisco-crime-classification 484,"'`Every year, approximately 7.6 million companion animals end up in US shelters. Many animals are given up as unwanted by their owners, while others are picked up after getting lost or taken out of cruelty situations. Many of these animals find forever families to take them home, but just as many are not so lucky. 2.7 million dogs and cats are euthanized in the US every year. Using a dataset of intake information including breed, color, sex, and age from the Austin Animal Center, we're asking Kagglers to predict the outcome for each animal. We also believe this dataset can help us understand trends in animal outcomes. These insights could help shelters focus their energy on specific animals who need a little extra help finding a new home. We encourage you to publish your insights on Scripts so they are publicly accessible. Acknowledgements Kaggle is hosting this competition for the machine learning community to use for data science practice and social good. The dataset is brought to you by Austin Animal Center. Shelter animal statistics were taken from the ASPCA. Glamour shots of Kaggle's shelter pets are pictured above. From left to right: Shelby, Bailey, Hazel, Daisy, and Yeti.`'",,Shelter Animal Outcomes,,Help improve outcomes for shelter animals,MulticlassLoss,shelter-animal-outcomes 485,"'`Skin cancer is the most prevalent type of cancer. Melanoma, specifically, is responsible for 75% of skin cancer deaths, despite being the least common skin cancer. The American Cancer Society estimates over 100,000 new melanoma cases will be diagnosed in 2020. It's also expected that almost 7,000 people will die from the disease. As with other cancers, early and accurate detectionpotentially aided by data sciencecan make treatment more effective. Currently, dermatologists evaluate every one of a patient's moles to identify outlier lesions or ugly ducklings that are most likely to be melanoma. Existing AI approaches have not adequately considered this clinical frame of reference. Dermatologists could enhance their diagnostic accuracy if detection algorithms take into account contextual images within the same patient to determine which images represent a melanoma. If successful, classifiers would be more accurate and could better support dermatological clinic work. As the leading healthcare organization for informatics in medical imaging, the Society for Imaging Informatics in Medicine (SIIM)'s mission is to advance medical imaging informatics through education, research, and innovation in a multi-disciplinary community. SIIM is joined by the International Skin Imaging Collaboration (ISIC), an international effort to improve melanoma diagnosis. The ISIC Archive contains the largest publicly available collection of quality-controlled dermoscopic images of skin lesions. In this competition, youll identify melanoma in images of skin lesions. In particular, youll use images within the same patient and determine which are likely to represent a melanoma. Using patient-level contextual information may help the development of image analysis tools, which could better support clinical dermatologists. Melanoma is a deadly disease, but if caught early, most melanomas can be cured with minor surgery. Image analysis tools that automate the diagnosis of melanoma will improve dermatologists' diagnostic accuracy. Better detection of melanoma has the opportunity to positively impact millions of people.`'",,SIIM-ISIC Melanoma Classification,,Identify melanoma in lesion images,MCAUC,siim-isic-melanoma-classification 486,"'`Finding footage of a crime caught on tape is an investigator's dream. But even with crystal clear, damning evidence, one critical question always remainsis the footage real? Today, one way to help authenticate footage is to identify the camera that the image was taken with. Forgeries often require splicing together content from two different cameras. But, unfortunately, the most common way to do this now is using image metadata, which can be easily falsified itself. This problem is actively studied by several researchers around the world. Many machine learning solutions have been proposed in the past: least-squares estimates of a camera's color demosaicing filters as classification features, co-occurrences of pixel value prediction errors as features that are passed to sophisticated ensemble classifiers, and using CNNs to learn camera model identification features. However, this is a problem yet to be sufficiently solved. For this competition, the IEEE Signal Processing Society is challenging you to build an algorithm that identifies which camera model captured an image by using traces intrinsically left in the image. Helping to solve this problem would have a big impact on the verification of evidence used in criminal and civil trials and even news reporting.`'",,IEEE's Signal Processing Society - Camera Model Identification,,Identify from which camera an image was taken,WeightedCategorizationAccuracy,ieees-signal-processing-society-camera-model-identification 487,"'`As I scurried across the candlelit chamber, manuscripts in hand, I thought I'd made it. Nothing would be able to hurt me anymore. Little did I know there was one last fright lurking around the corner. DING! My phone pinged me with a disturbing notification. It was Will, the scariest of Kaggle moderators, sharing news of another data leak. ""phnglui mglwnafh Cthulhu Rlyeh wgahnagl fhtagn!"" I cried as I clumsily dropped my crate of unbound, spooky books. Pages scattered across the chamber floor. How will I ever figure out how to put them back together according to the authors who wrote them? Or are they lost, forevermore? Wait, I thought... I know, machine learning! In this year's Halloween playground competition, you're challenged to predict the author of excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft. We're encouraging you (with cash prizes!) to share your insights in the competition's discussion forum and code in Kernels. We've designated prizes to reward authors of kernels and discussion threads that are particularly valuable to the community. Click the ""Prizes"" tab on this overview page to learn more. Getting Started New to Kernels or working with natural language data? We've put together some starter kernels in Python and R to help you hit the ground running.`'",,Spooky Author Identification,,Share code and discuss insights to identify horror authors from their writings,MulticlassLoss,spooky-author-identification 488,"'`Springleaf puts the humanity back into lending by offering their customers personal and auto loans that help them take control of their lives and their finances. Direct mail is one important way Springleaf's team can connect with customers whom may be in need of a loan. Direct offers provide huge value to customers who need them, and are a fundamental part of Springleaf's marketing strategy. In order to improve their targeted efforts, Springleaf must be sure they are focusing on the customers who are likely to respond and be good candidates for their services. Using a large set of anonymized features, Springleaf is asking you to predict which customers will respond to a direct mail offer. You are challenged to construct new meta-variables and employ feature-selection methods to approach this dauntingly wide dataset.`'",,Springleaf Marketing Response,,Determine whether to send a direct mail piece to a customer ,AUC,springleaf-marketing-response 489,"'`The dataset contains five variables. The variable description is as follows. X1 = height in feet X2 = weight in pounds X3 = percent of successful field goals (out of 100 attempted) X4 = percent of successful free throws (out of 100 attempted) X5 = average points scored per game a. Do a thorough descriptive analysis and identify the patterns and potential significant variables. Use appropriate plots and tables. b. Fit a suitable predictive model to predict the average points scored per game for using the traindata dataset. [You may use transformations, etc. techniques to improve the model] c. Get the predictions from the model for the testdata dataset. Submit your predictions. (Please refer to the sampleSubmissionFile.csv)`'",,ST4035_2020 Inclass #1,inClass,ST4035_2020 Inclass #1,rmse,st4035_2020-inclass-#1 490,"'`Winning the fight against the COVID-19 pandemic will require an effective vaccine that can be equitably and widely distributed. Building upon decades of research has allowed scientists to accelerate the search for a vaccine against COVID-19, but every day that goes by without a vaccine has enormous costs for the world nonetheless. We need new, fresh ideas from all corners of the world. Could online gaming and crowdsourcing help solve a worldwide pandemic? Pairing scientific and crowdsourced intelligence could help computational biochemists make measurable progress. mRNA vaccines have taken the lead as the fastest vaccine candidates for COVID-19, but currently, they face key potential limitations. One of the biggest challenges right now is how to design super stable messenger RNA molecules (mRNA). Conventional vaccines (like your seasonal flu shots) are packaged in disposable syringes and shipped under refrigeration around the world, but that is not currently possible for mRNA vaccines. Researchers have observed that RNA molecules have the tendency to spontaneously degrade. This is a serious limitation--a single cut can render the mRNA vaccine useless. Currently, little is known on the details of where in the backbone of a given RNA is most prone to being affected. Without this knowledge, current mRNA vaccines against COVID-19 must be prepared and shipped under intense refrigeration, and are unlikely to reach more than a tiny fraction of human beings on the planet unless they can be stabilized. The Eterna community, led by Professor Rhiju Das, a computational biochemist at Stanfords School of Medicine, brings together scientists and gamers to solve puzzles and invent medicine. Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles. The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules. The Eterna community has previously unlocked new scientific principles, made new diagnostics against deadly diseases, and engaged the worlds most potent intellectual resources for the betterment of the public. The Eterna community has advanced biotechnology through its contribution in over 20 publications, including advances in RNA biotechnology. In this competition, we are looking to leverage the data science expertise of the Kaggle community to develop models and design rules for RNA degradation. Your model will predict likely degradation rates at each base of an RNA molecule, trained on a subset of an Eterna dataset comprising over 3000 RNA molecules (which span a panoply of sequences and structures) and their degradation rates at each position. We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines. These final test sequences are currently being synthesized and experimentally characterized at Stanford University in parallel to your modeling efforts -- Nature will score your models! Improving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve. Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19. The problem we are trying to solve has eluded academic labs, industry R&D groups, and supercomputers, and so we are turning to you. To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic. `'",,OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction,,Urgent need to bring the COVID-19 vaccine to mass production,MWCRMSE,openvaccine:-covid-19-mrna-vaccine-degradation-prediction 491,"'`We've all been there: a light turns green and the car in front of you doesn't budge. Or, a previously unremarkable vehicle suddenly slows and starts swerving from side-to-side. When you pass the offending driver, what do you expect to see? You certainly aren't surprised when you spot a driver who is texting, seemingly enraptured by social media, or in a lively hand-held conversation on their phone. According to the CDC motor vehicle safety division, one in five car accidents is caused by a distracted driver. Sadly, this translates to 425,000 people injured and 3,000 people killed by distracted driving every year. State Farm hopes to improve these alarming statistics, and better insure their customers, by testing whether dashboard cameras can automatically detect drivers engaging in distracted behaviors. Given a dataset of 2D dashboard camera images, State Farm is challenging Kagglers to classify each driver's behavior. Are they driving attentively, wearing their seatbelt, or taking a selfie with their friends in the backseat?`'",,State Farm Distracted Driver Detection,,Can computer vision spot distracted drivers?,MulticlassLoss,state-farm-distracted-driver-detection 492,"'`Drifting icebergs present threats to navigation and activities in areas such as offshore of the East Coast of Canada. Currently, many institutions and companies use aerial reconnaissance and shore-based support to monitor environmental conditions and assess risks from icebergs. However, in remote areas with particularly harsh weather, these methods are not feasible, and the only viable monitoring option is via satellite. Statoil, an international energy company operating worldwide, has worked closely with companies like C-CORE. C-CORE have been using satellite data for over 30 years and have built a computer vision based surveillance system. To keep operations safe and efficient, Statoil is interested in getting a fresh new perspective on how to use machine learning to more accurately detect and discriminate against threatening icebergs as early as possible. In this competition, youre challenged to build an algorithm that automatically identifies if a remotely sensed target is a ship or iceberg. Improvements made will help drive the costs down for maintaining safe working conditions.`'",,Statoil/C-CORE Iceberg Classifier Challenge,,"Ship or iceberg, can you decide from space?",LogLoss,statoil/c-core-iceberg-classifier-challenge 493,"'`Driving while distracted, fatigued or drowsy may lead to accidents. Activities that divert the driver's attention from the road ahead, such as engaging in a conversation with other passengers in the car, making or receiving phone calls, sending or receiving text messages, eating while driving or events outside the car may cause driver distraction. Fatigue and drowsiness can result from driving long hours or from lack of sleep. The objective of this challenge is to design a detector/classifier that will detect whether the driver is alert or not alert, employing any combination of vehicular, environmental and driver physiological data that are acquired while driving. The winner receives free registration to the 2011 International Joint Conference on Neural Networks (San Jose, California July 31 - August 5, 2011), which is valued at $950. The winner will also be invited to present their solution at the conference.`'",,Stay Alert! The Ford Challenge,featured,"Driving while not alert can be deadly. The objective is to design a classifier that will detect whether the driver is alert or not alert, employing data that are acquired while driving.",AUC,stay-alert!-the-ford-challenge 494,"'`This competition is designed to help you get started with Julia. If you are looking for a good programming language for data science, or if you are already accustomed to one language, we encourage you to also try Julia. Julia is a relatively new language for technical computing that attempts to combine the strengths of other popular programming languages. Here we introduce two tutorials to highlight some of Julia's features. The first is focused on the basics of the language. In the second, a complete implementation of the K Nearest Neighbor algorithm is presented, highlighting features such as parallelization and speed. Both tutorials show that it is easy to write code in Julia, due to its intuitive syntax and design. The tutorials also describe some basics of image processing and some concepts of machine learning such as cross validation. After reviewing them, we hope you will be motivated to write your own machine learning algorithms in Julia. This tutorial focuses on the task of identifying characters from Google Street View images. It differs from traditional character recognition because the data set contains different character fonts and the background is not the same for all images. Acknowledgements The data was taken from the Chars74K dataset, which consists of images of characters selected from Google Street View images. We ask that you cite the following reference in any publication resulting from your work: T. E. de Campos, B. R. Babu and M. Varma, Character recognition in natural images, Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP), Lisbon, Portugal, February 2009. This tutorial was developed by Luis Tandalla during his summer 2014 internship at Kaggle.`'",,First Steps With Julia,,Use Julia to identify characters from Google Street View images,CategorizationAccuracy,first-steps-with-julia 495,"'`Synthessence 2018 Put your machine learning and analytical skills to the test with Synthessence! Problem Statement Given customer's review of restaurant and features of the restaurant, predict the customer's rating of the restaurant. This is a classification task and the evaluation metric to be used for judgement is Mean F1 score. To be eligible for prizes, fill this google form. Prizes up for grabs First place: Rs. 6000 Second place: Rs. 3500 Third place: Rs. 2500 Organized by This competition has been organized by Engineer, technical fest of NITK, Surathkal`'",,Synthessence 2018,inClass,Data Science competition to predict customer's rating of restaurant,meanfscore,synthessence-2018 496,"'`Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, we've launched many Playground competitions that are more approachable than our Featured competitions and thus, more beginner-friendly. In order to have a more consistent offering of these competitions for our community, we're trying a new experiment in 2021. We'll be launching month-long tabular Playground competitions on the 1st of every month and continue the experiment as long as there's sufficient interest and participation. The goal of these competitions is to provide a fun, and approachable for anyone, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition. If you're an established competitions master or grandmaster, these probably won't be much of a challenge for you. We encourage you to avoid saturating the leaderboard. For each monthly competition, we'll be offering Kaggle Merchandise for the top three teams. And finally, because we want these competitions to be more about learning, we're limiting team sizes to 3 individuals. The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the amount of an insurance claim. Although the features are anonymized, they have properties relating to real-world features. Good luck and have fun! Getting Started Check out this Starter Notebook which walks you through how to make your very first submission! For more ideas on how to improve your score, check out the Intro to Machine Learning and Intermediate Machine Learning courses on Kaggle Learn.`'",,Tabular Playground Series - Feb 2021,,Practice your ML skills on this approachable dataset!,RMSE,tabular-playground-series-feb-2021 497,"'`Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, we've launched many Playground competitions that are more approachable than our Featured competitions, and thus more beginner-friendly. In order to have a more consistent offering of these competitions for our community, we're trying a new experiment in 2021. We'll be launching a month-long tabular Playground competition on the 1st of every month, and continue the experiment as long as there's sufficient interest and participation. The goal of these competitions is to provide a fun, but less challenging, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition. If you're an established competitions master or grandmaster, these probably won't be much of a challenge for you. We encourage you to avoid saturating the leaderboard. For each monthly competition, we'll be offering Kaggle Merchandise for the top three teams. And finally, because we want these competitions to be more about learning, we're limiting team sizes to 3 individuals. Good luck and have fun! Getting Started Check out this Starter Notebook which walks you through how to make your very first submission! For more ideas on how to improve your score, check out the Intro to Machine Learning and Intermediate Machine Learning courses on Kaggle Learn.`'",,Tabular Playground Series - Jan 2021,,Practice your ML regression skills on this approachable dataset!,RMSE,tabular-playground-series-jan-2021 498,"'`Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest mobile market in the world and therefore suffers from huge volumes of fradulent traffic. TalkingData, Chinas largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a users click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they've built an IP blacklist and device blacklist. While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution. In their 2nd competition with Kaggle, youre challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support your modeling, they have provided a generous dataset covering approximately 200 million clicks over 4 days!`'",,TalkingData AdTracking Fraud Detection Challenge,,Can you detect fraudulent click traffic for mobile app ads?,AUC,talkingdata-adtracking-fraud-detection-challenge 499,"'`Rudolph the red-nosed reindeer Had some very tired hooves But he had a job to finish Could he do it with the shortest moves? All of the other reindeer Used to laugh and mock his code They always said poor Rudolph Couldn't handle the workload Then one foggy Christmas Eve Santa came to say I see you've taken number theory Please make this night a bit less dreary? Then how the reindeer loved him and each enrolled in an AI degree Rudolph the red-nosed reindeer We get to go to bed early! Rudolph has always believed in working smarter, not harder. And what better way to earn the respect of Comet and Blitzen than showing the initiative to improve Santa's annual route for delivering toys on Christmas Eve? This year, Rudolph believes he can motivate the overworked Reindeer team by wisely choosing the order in which they visit the houses on Santa's list. The houses in prime cities always leave carrots for the Reindeers alongside the usual cookies and milk. These carrots are just the sustenance the Reindeers need to keep pace. In fact, Rudolph has found that if the Reindeer team doesn't originate from a prime city exactly every 10th step, it takes the 10% longer than it normally would to make their next destination! Can you help Rudolph solve the Traveling Santa problem subject to his carrot constraint? His team--and Santa--are counting on you! Attributions: Reindeer Photo: Norman Tsui Stocking Photo: Wesley Tingey`'",,Traveling Santa 2018 - Prime Paths,,"But does your code recall, the most efficient route of all?",TravelingSanta2,traveling-santa-2018-prime-paths 500,"'`After centuries of intense whaling, recovering whale populations still have a hard time adapting to warming oceans and struggle to compete every day with the industrial fishing industry for food. To aid whale conservation efforts, scientists use photo surveillance systems to monitor ocean activity. They use the shape of whales tails and unique markings found in footage to identify what species of whale theyre analyzing and meticulously log whale pod dynamics and movements. For the past 40 years, most of this work has been done manually by individual scientists, leaving a huge trove of data untapped and underutilized. In this competition, youre challenged to build an algorithm to identifying whale species in images. Youll analyze Happy Whales database of over 25,000 images, gathered from research institutions and public contributors. By contributing, youll help to open rich fields of understanding for marine mammal population dynamics around the globe. We'd like to thank Happy Whale for providing this data and problem. Happy Whale is a platform that uses image process algorithms to let anyone to submit their whale photo and have it automatically identified.`'",,Humpback Whale Identification Challenge,,Can you identify a whale by the picture of its fluke?,MAP@{K},humpback-whale-identification-challenge 501,,,Porto Seguro’s Safe Driver Prediction,,Predict if a driver will file an insurance claim next year.,NormalizedGini,porto-seguro’s-safe-driver-prediction 502,,,LANL Earthquake Prediction,,Can you predict upcoming laboratory earthquakes?,MAE,lanl-earthquake-prediction 503,,,Optiver Realized Volatility Prediction,,Apply your data science skills to make financial markets better,RootMeanSquarePercentageError,optiver-realized-volatility-prediction 504,,,Zillow Prize: Zillow’s Home Value Prediction (Zestimate),,Can you improve the algorithm that changed the world of real estate?,ZillowMAE,zillow-prize:-zillow’s-home-value-prediction-(zestimate) 505,,,2018 Data Science Bowl ,,Find the nuclei in divergent images to advance medical discovery,IntersectionOverUnionObjectSegmentation,2018-data-science-bowl- 506,,,CommonLit Readability Prize,,Rate the complexity of literary passages for grades 3-12 classroom use,RMSE,commonlit-readability-prize 507,,,PetFinder.my - Pawpularity Contest,,Predict the popularity of shelter pet photos,RMSE,petfinder.my-pawpularity-contest 508,,,2019 Data Science Bowl,,Uncover the factors to help measure how young children learn,QuadraticWeightedKappa,2019-data-science-bowl 509,"'`Get started on this competition through Kaggle Scripts Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world. The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C. Acknowledgements Kaggle is hosting this competition for the machine learning community to use for fun and practice. This dataset was provided by Hadi Fanaee Tork using data from Capital Bikeshare. We also thank the UCI machine learning repository for hosting the dataset. If you use the problem in publication, please cite: Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.`'",tabular data_time series,Bike Sharing Demand,playground,Forecast use of a city bikeshare system,RMSLE,bike-sharing-demand 510,,,TGS Salt Identification Challenge,,Segment salt deposits beneath the Earth's surface,IntersectionOverUnionObjectSegmentation,tgs-salt-identification-challenge 511,,,15.071x - The Analytics Edge (Spring 2015),,Test your analytics skills by predicting which New York Times blog articles will be the most popular,,15.071x-the-analytics-edge-(spring-2015) 512,,,Ubiquant Market Prediction,,Make predictions against future market data,,ubiquant-market-prediction 513,,,Google Brain - Ventilator Pressure Prediction,,"Simulate a ventilator connected to a sedated patient's lung ",MAE,google-brain-ventilator-pressure-prediction 514,,,Shopee - Price Match Guarantee,,Determine if two products are the same by their images,MeanFScore,shopee-price-match-guarantee 515,,,Jigsaw Rate Severity of Toxic Comments ,,Rank relative ratings of toxicity between comments,JigsawAgreementWithAnnotators,jigsaw-rate-severity-of-toxic-comments--- 516,,,Global Wheat Detection ,,Can you help identify wheat heads using image analysis?,RSNAObjectDetectionAP,global-wheat-detection- 517,,,G-Research Crypto Forecasting ,,Use your ML expertise to predict real crypto market data,,g-research-crypto-forecasting- 518,,,Feedback Prize - Evaluating Student Writing,,Analyze argumentative writing elements from students grade 6-12 ,,feedback-prize-evaluating-student-writing 519,,,TensorFlow - Help Protect the Great Barrier Reef ,,Detect crown-of-thorns starfish in underwater image data,CSIROObjectDetectionFBeta,tensorflow-help-protect-the-great-barrier-reef- 520,,,Tabular Playground Series - Sep 2021,,Practice your ML skills on this approachable dataset!,AUC,tabular-playground-series-sep-2021 521,,,H&M Personalized Fashion Recommendations,,Provide product recommendations based on previous purchases,,h&m-personalized-fashion-recommendations 522,,,Tabular Playground Series - Aug 2021,,Practice your ML skills on this approachable dataset!,RMSE,tabular-playground-series-aug-2021 523,,,The Analytics Edge (15.071x),,Learn what predicts happiness by using informal polling questions.,,the-analytics-edge-(15.071x) 524,,,Corporación Favorita Grocery Sales Forecasting,,Can you accurately predict sales for a large grocery chain?,NWRMSLE,corporación-favorita-grocery-sales-forecasting 525,,,Coleridge Initiative - Show US the Data ,,Discover how data is used for the public good,JaccardFbeta,coleridge-initiative-show-us-the-data- 526,,,Tabular Playground Series - Jan 2022,,Practice your ML skills on this approachable dataset!,SMAPE,tabular-playground-series-jan-2022 527,,,RSNA-MICCAI Brain Tumor Radiogenomic Classification,,Predict the status of a genetic biomarker important for brain cancer treatment,AUC,rsna-miccai-brain-tumor-radiogenomic-classification 528,,,Driver Telematics Analysis,,Use telematic data to identify a driver signature,AUC,driver-telematics-analysis 529,,,Sartorius - Cell Instance Segmentation,,Detect single neuronal cells in microscopy images,IntersectionOverUnionObjectSegmentation,sartorius-cell-instance-segmentation 530,,,Tabular Playground Series - Mar 2021,,Practice your ML skills on this approachable dataset!,AUC,tabular-playground-series-mar-2021 531,,,Happywhale - Whale and Dolphin Identification,,Identify whales and dolphins by unique characteristics,,happywhale-whale-and-dolphin-identification 532,,,CareerCon 2019 - Help Navigate Robots ,,Compete to get your resume in front of our sponsors,CategorizationAccuracy,careercon-2019-help-navigate-robots- 533,,,Store Sales - Time Series Forecasting,,Use machine learning to predict grocery sales,,store-sales-time-series-forecasting 534,,,Tabular Playground Series - Nov 2021,,Practice your ML skills on this approachable dataset!,AUC,tabular-playground-series-nov-2021 535,,,Heritage Health Prize,,Identify patients who will be admitted to a hospital within the next year using historical claims data. (Enter by 06:59:59 UTC Oct 4 2012) ,RMSLE,heritage-health-prize 536,,,Machinery Tube Pricing,,Model quoted prices for industrial tube assemblies,RMSLE,machinery-tube-pricing 537,,,SIIM-FISABIO-RSNA COVID-19 Detection,,Identify and localize COVID-19 abnormalities on chest radiographs,OpenImagesObjectDetectionAP,siim-fisabio-rsna-covid-19-detection 538,,,Tabular Playground Series - Jul 2021,,Practice your ML skills on this approachable dataset!,MCRMSLE,tabular-playground-series-jul-2021 539,,,Spaceship Titanic,,Predict which passengers are transported to an alternate dimension,,spaceship-titanic 540,,,Tabular Playground Series - Feb 2022,,Practice your ML skills on this approachable dataset!,CategorizationAccuracy,tabular-playground-series-feb-2022 541,,,Tabular Playground Series - Apr 2021,,Synthanic - You're going to need a bigger boat,CategorizationAccuracy,tabular-playground-series-apr-2021 542,,,Africa Soil Property Prediction Challenge ,,Predict physical and chemical properties of soil using spectral measurements,MCRMSE,africa-soil-property-prediction-challenge- 543,,,G2Net Gravitational Wave Detection,,Find gravitational wave signals from binary black hole collisions,AUC,g2net-gravitational-wave-detection 544,,,Tabular Playground Series - Dec 2021,,Practice your ML skills on this approachable dataset!,CategorizationAccuracy,tabular-playground-series-dec-2021 545,,,Lux AI,,Gather the most resources and survive the night!,lux_ai_2021,lux-ai 546,,,Tabular Playground Series - Jun 2021,,Practice your ML skills on this approachable dataset!,MulticlassLoss,tabular-playground-series-jun-2021 547,,,Histopathologic Cancer Detection,,Identify metastatic tissue in histopathologic scans of lymph node sections,AUC,histopathologic-cancer-detection 548,,,Tabular Playground Series - May 2021,,Practice your ML skills on this approachable dataset!,MulticlassLoss,tabular-playground-series-may-2021 549,,,Tabular Playground Series - Oct 2021,,Practice your ML skills on this approachable dataset!,AUC,tabular-playground-series-oct-2021 550,,,NBME - Score Clinical Patient Notes,,Identify Key Phrases in Patient Notes from Medical Licensing Exams,,nbme-score-clinical-patient-notes 551,,,Facebook Recruiting IV: Human or Robot?,,Predict if an online bid is made by a machine or a human,AUC,facebook-recruiting-iv:-human-or-robot? 552,,,Tabular Playground Series - Mar 2022,,Practice your ML skills on this approachable dataset!,,tabular-playground-series-mar-2022 553,,,chaii - Hindi and Tamil Question Answering,,Identify the answer to questions found in Indian language passages,Jaccard,chaii-hindi-and-tamil-question-answering 554,,,Lyft Motion Prediction for Autonomous Vehicles,,Build motion prediction models for self-driving vehicles ,PostProcessorKernel,lyft-motion-prediction-for-autonomous-vehicles 555,,,Google Cloud & NCAA® ML Competition 2018-Men's,,Apply Machine Learning to NCAA® March Madness®,LogLoss,google-cloud-&-ncaa®-ml-competition-2018-mens 556,,,March Machine Learning Mania 2022 - Men’s,,Predict the 2022 College Men's Basketball Tournament,,march-machine-learning-mania-2022-men’s 557,,,Bristol-Myers Squibb – Molecular Translation,,Can you translate chemical images to text?,LevenshteinMean,bristol-myers-squibb-–-molecular-translation 558,,,Santa 2021 - The Merry Movie Montage,,Optimize television programming for the winter season,SantasSuperpermutations2021,santa-2021-the-merry-movie-montage 559,,,Google Cloud & NCAA® ML Competition 2019-Men's,,Apply Machine Learning to NCAA® March Madness®,LogLoss,google-cloud-&-ncaa®-ml-competition-2019-mens 560,"'`""There's a thin line between likably old-fashioned and fuddy-duddy, and The Count of Monte Cristo ... never quite settles on either side."" The Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis, originally collected by Pang and Lee [1]. In their work on sentiment treebanks, Socher et al. [2] used Amazon's Mechanical Turk to create fine-grained labels for all parsed phrases in the corpus. This competition presents a chance to benchmark your sentiment-analysis ideas on the Rotten Tomatoes dataset. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging. Kaggle is hosting this competition for the machine learning community to use for fun and practice. This competition was inspired by the work of Socher et al [2]. We encourage participants to explore the accompanying (and dare we say, fantastic) website that accompanies the paper: http://nlp.stanford.edu/sentiment/ There you will find have source code, a live demo, and even an online interface to help train the model. [1] Pang and L. Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, pages 115124. [2] Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Chris Manning, Andrew Ng and Chris Potts. Conference on Empirical Methods in Natural Language Processing (EMNLP 2013).`'",tabular data_time series,Sentiment Analysis on Movie Reviews,playground,Classify the sentiment of sentences from the Rotten Tomatoes dataset,CategorizationAccuracy,sentiment-analysis-on-movie-reviews 561,,,MLB Player Digital Engagement Forecasting,,Predict fan engagement with baseball player digital content,MeanColumnwiseMAE,mlb-player-digital-engagement-forecasting 562,"'`Can you help end gender bias in pronoun resolution? Pronoun resolution is part of coreference resolution, the task of pairing an expression to its referring entity. This is an important task for natural language understanding, and the resolution of ambiguous pronouns is a longstanding challenge. Unfortunately, recent studies have suggested gender bias among state-of-the-art coreference resolvers. Google AI Language aims to improve gender-fairness in modeling by releasing the Gendered Ambiguous Pronouns (GAP) dataset, containing gender-balanced pronouns (50% of its examples containing feminine pronouns, and 50% containing masculine pronouns). In this two-stage competition, Kagglers are challenged to build pronoun resolution systems that perform equally well regardless of pronoun gender. Stage two's final evaluation will use a new dataset following the same format. To encourage gender-fair modeling, the ratio of masculine to feminine examples in the official test data will not be known ahead of time. ---------- Please cite the original paper if you use GAP in your work: @inproceedings{webster2018gap, title = {Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns}, author = {Webster, Kellie and Recasens, Marta and Axelrod, Vera and Baldridge, Jason}, booktitle = {Transactions of the ACL}, year = {2018}, pages = {to appear}, }`'",tabular data_time series,Gendered Pronoun Resolution,research,Pair pronouns to their correct entities,MulticlassLoss,gendered-pronoun-resolution 563,,,NFL Health & Safety - Helmet Assignment,,Segment and label helmets in video footage,NFLHelmetIdentification,nfl-health-&-safety-helmet-assignment 564,,,BirdCLEF 2021 - Birdcall Identification,,"Identify bird calls in soundscape recordings ",MeanFScoreBeta,birdclef-2021-birdcall-identification 565,,,Google Smartphone Decimeter Challenge,,Improve high precision GNSS positioning and navigation accuracy on smartphones,SmartphoneDecimeter,google-smartphone-decimeter-challenge 566,,,SETI Breakthrough Listen - E.T. Signal Search,,Find extraterrestrial signals in data from deep space ,AUC,seti-breakthrough-listen-e.t.-signal-search 567,,,"Ghouls, Goblins, and Ghosts... Boo! ",,Can you classify monsters haunting Kaggle?,CategorizationAccuracy,"ghouls,-goblins,-and-ghosts...-boo!-" 568,,,March Machine Learning Mania 2021 - NCAAM,,Predict the 2021 NCAAM Basketball Tournament,LogLoss,march-machine-learning-mania-2021-ncaam 569,,,Flavours of Physics: Finding τ → μμμ,,Identify a rare decay phenomenon,CernWeightedAuc,flavours-of-physics:-finding-τ--→--μμμ 570,,,"KDD Cup 2012, Track 1",,Predict which users (or information sources) one user might follow in Tencent Weibo.,MAP@3,"kdd-cup-2012,-track-1" 571,,,March Machine Learning Mania 2022 - Women's,,Predict the 2022 College Women's Basketball Tournament,,march-machine-learning-mania-2022-womens 572,,,Liberty Mutual Group - Fire Peril Loss Cost,,Predict expected fire losses for insurance policies,NormalizedWeightedGini,liberty-mutual-group-fire-peril-loss-cost 573,,,Accelerometer Biometric Competition,,Recognize users of mobile devices from accelerometer data,AUC,accelerometer-biometric-competition 574,,,Cdiscount’s Image Classification Challenge,,Categorize e-commerce photos,CategorizationAccuracy,cdiscount’s-image-classification-challenge 575,,,Plant Pathology 2021 - FGVC8 ,,Identify the category of foliar diseases in apple trees,MeanFScore,plant-pathology-2021-fgvc8- 576,,,PAKDD 2014 - ASUS Malfunctional Components Prediction,,Predict malfunctional components of ASUS notebooks,MAE,pakdd-2014-asus-malfunctional-components-prediction 577,,,KDD Cup 2013 - Author-Paper Identification Challenge (Track 1),,Determine whether an author has written a given paper,MAP@{K}_OLD,kdd-cup-2013-author-paper-identification-challenge-(track-1) 578,,,Google Cloud & NCAA® ML Competition 2018-Women's,,Apply machine learning to NCAA® March Madness®,LogLoss,google-cloud-&-ncaa®-ml-competition-2018-womens 579,,,Google Cloud & NCAA® ML Competition 2019-Women's,,Apply Machine Learning to NCAA® March Madness®,LogLoss,google-cloud-&-ncaa®-ml-competition-2019-womens 580,,,Walmart Recruiting II: Sales in Stormy Weather,,Predict how sales of weather-sensitive products are affected by snow and rain,RMSLE,walmart-recruiting-ii:-sales-in-stormy-weather 581,,,U.S. Patent Phrase to Phrase Matching ,,Help Identify Similar Phrases in U.S. Patents,,u.s.-patent-phrase-to-phrase-matching- 582,,,Melbourne University AES/MathWorks/NIH Seizure Prediction,,Predict seizures in long-term human intracranial EEG recordings ,AUC,melbourne-university-aes/mathworks/nih-seizure-prediction 583,,,KDD Cup 2014 - Predicting Excitement at DonorsChoose.org,,Predict funding requests that deserve an A+,AUC,kdd-cup-2014-predicting-excitement-at-donorschoose.org 584,,,March Machine Learning Mania 2021 - NCAAW,,Predict the 2021 NCAAW Basketball Tournament,LogLoss,march-machine-learning-mania-2021-ncaaw 585,,,March Machine Learning Mania 2017,,Predict the 2017 NCAA Basketball Tournament,LogLoss,march-machine-learning-mania-2017 586,,,Helping Santa's Helpers,,"Jingle bells, Santa tells ... ",SantaJobScheduling,helping-santas-helpers 587,,,The Big Data Combine Engineered by BattleFin,,Predict short term movements in stock prices using news and sentiment data provided by RavenPack ,MAE,the-big-data-combine-engineered-by-battlefin 588,,,Facebook Recruiting Competition,,"Show them your talent, not just your resume.",MAP@k,facebook-recruiting-competition 589,,,Avito Context Ad Clicks,,Predict if context ads will earn a user's click ,LogLoss,avito-context-ad-clicks 590,,,Google Landmark Recognition 2021,,"Label famous, and not-so-famous, landmarks in images",GoogleGlobalAP,google-landmark-recognition-2021 591,,,Microsoft Malware Classification Challenge (BIG 2015),,Classify malware into families based on file content and characteristics,MulticlassLossOld,microsoft-malware-classification-challenge-(big-2015) 592,,,Online Product Sales,,Predict the online sales of a consumer product based on a data set of product features.,RMSLE,online-product-sales 593,,,Packing Santa's Sleigh,,"He's making a list, checking it twice; to fill up his sleigh, he needs your advice",PackingSantasSleigh,packing-santas-sleigh 594,,,BirdCLEF 2022,,Identify bird calls in soundscapes,,birdclef-2022 595,,,RTA Freeway Travel Time Prediction,,This competition requires participants to predict travel time on Sydney's M4 freeway from past travel time observations.,RMSE,rta-freeway-travel-time-prediction 596,,,Observing Dark Worlds,,Can you find the Dark Matter that dominates our Universe? Winton Capital offers you the chance to unlock the secrets of dark worlds.,DarkWorldsMetric,observing-dark-worlds 597,,,Yelp Recruiting Competition,,"How many ""useful"" votes will a Yelp review receive? Show off your skills to land an interview for a position on a Yelp data mining team!",RMSLE,yelp-recruiting-competition 598,,,ECML/PKDD 15: Taxi Trip Time Prediction (II),,Predict the total travel time of taxi trips based on their initial partial trajectories,RMSLE,ecml/pkdd-15:-taxi-trip-time-prediction-(ii) 599,,,ICDM 2015: Drawbridge Cross-Device Connections,,Identify individual users across their digital devices,MeanFScoreBeta,icdm-2015:-drawbridge-cross-device-connections 600,,,Tabular Playground Series - Apr 2022,,Practice your ML skills on this approachable dataset!,,tabular-playground-series-apr-2022 601,,,MLSP 2014 Schizophrenia Classification Challenge,,Diagnose schizophrenia using multimodal features from MRI scans,AUC,mlsp-2014-schizophrenia-classification-challenge 602,,,The Hunt for Prohibited Content,,Predict which ads contain illicit content,AP@{K},the-hunt-for-prohibited-content 603,,,dunnhumby's Shopper Challenge,,"Going grocery shopping, we all have to do it, some even enjoy it, but can you predict it? dunnhumby is looking to build a model to better predict when supermarket shoppers will next visit the store and how much they will spend. ",PercentCorrectVisits,dunnhumbys-shopper-challenge 604,,,Truly Native? ,,Predict which web pages served by StumbleUpon are sponsored,AUC,truly-native?- 605,,,DecMeg2014 - Decoding the Human Brain,,Predict visual stimuli from MEG recordings of human brain activity,CategorizationAccuracy,decmeg2014-decoding-the-human-brain 606,,,Cause-effect pairs,,"Given samples from a pair of variables A, B, find whether A is a cause of B.",CauseEffectBidirectionalAuc,cause-effect-pairs 607,,,Don't call me turkey!,,Thanksgiving Edition: Find the turkey in the sound bite,AUC,dont-call-me-turkey! 608,,,Benchmark Bond Trade Price Challenge,,Develop models to accurately predict the trade price of a bond.,WMAE,benchmark-bond-trade-price-challenge 609,,,Google Landmark Retrieval 2021,,"Given an image, can you find all of the same landmarks in a dataset?",MAP@{K},google-landmark-retrieval-2021 610,,,BCI Challenge @ NER 2015,,A spell on you if you cannot detect errors!,AUC,bci-challenge-@-ner-2015 611,,,Don't Overfit!,,"With nearly as many variables as training cases, what are the best techniques to avoid disaster? ",AUC,dont-overfit! 612,,,Partly Sunny with a Chance of Hashtags,,What can a #machine learn from tweets about the #weather?,RMSE,partly-sunny-with-a-chance-of-hashtags 613,,,Chess ratings - Elo versus the Rest of the World,,This competition aims to discover whether other approaches can predict the outcome of chess games more accurately than the workhorse Elo rating system.,RMSE,chess-ratings-elo-versus-the-rest-of-the-world 614,,,March Machine Learning Mania,,Tip off college basketball by predicting the 2014 NCAA Tournament,LogLoss,march-machine-learning-mania 615,,,The Marinexplore and Cornell University Whale Detection Challenge,,"Create an algorithm to detect North Atlantic right whale calls from audio recordings, prevent collisions with shipping traffic",AUC,the-marinexplore-and-cornell-university-whale-detection-challenge 616,,,U.S. Census Return Rate Challenge,,Predict census mail return rates.,NWMAE,u.s.-census-return-rate-challenge 617,,,iMaterialist (Fashion) 2019 at FGVC6 ,,Fine-grained segmentation task for fashion and apparel,IntersectionOverUnionObjectSegmentationWithClassification,imaterialist-(fashion)-2019-at-fgvc6- 618,,,What Do You Know?,,Improve the state of the art in student evaluation by predicting whether a student will answer the next test question correctly.,CappedBinomialDeviance,what-do-you-know? 619,,,KDD Cup 2013 - Author Disambiguation Challenge (Track 2),,Identify which authors correspond to the same person,MeanFScore,kdd-cup-2013-author-disambiguation-challenge-(track-2) 620,,,Merck Molecular Activity Challenge,,Help develop safe and effective medicines by predicting molecular activity.,WeightedR2,merck-molecular-activity-challenge 621,"'`This week 2 forecasting task is now closed for submissions. Click here to visit the week 3 version, and make a submission there. This is week 2 of Kaggle's COVID19 forecasting series, following the Week 1 competition. This is the 2nd of at least 4 competitions we plan to launch in this series. Background The White House Office of Science and Technology Policy (OSTP) pulled together a coalition research groups and companies (including Kaggle) to prepare the COVID-19 Open Research Dataset (CORD-19) to attempt to address key open scientific questions on COVID-19. Those questions are drawn from National Academies of Sciences, Engineering, and Medicines (NASEM) and the World Health Organization (WHO). The Challenge Kaggle is launching a companion COVID-19 forecasting challenges to help answer a subset of the NASEM/WHO questions. While the challenge involves forecasting confirmed cases and fatalities between April 1 and April 30 by region, the primary goal isn't only to produce accurate forecasts. Its also to identify factors that appear to impact the transmission rate of COVID-19. You are encouraged to pull in, curate and share data sources that might be helpful. If you find variables that look like they impact the transmission rate, please share your finding in a notebook. As the data becomes available, we will update the leaderboard with live results based on data made available from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). We have received support and guidance from health and policy organizations in launching these challenges. We're hopeful the Kaggle community can make valuable contributions to developing a better understanding of factors that impact the transmission of COVID-19. Companies and Organizations There is also a call to action for companies and other organizations: If you have datasets that might be useful, please upload them to Kaggles dataset platform and reference them in this forum thread. That will make them accessible to those participating in this challenge and a resource to the wider scientific community. Acknowledgements JHU CSSE for making the data available to the public. The White House OSTP for pulling together the key open questions. The image comes from the Center for Disease Control. This is a Code Competition. Refer to Code Requirements for details.`'",tabular data_time series,COVID19 Global Forecasting (Week 2),research,Forecast daily COVID-19 spread in regions around world,MCRMSLE,covid19-global-forecasting-(week-2) 622,,,Challenges in Representation Learning: The Black Box Learning Challenge,,"Competitors train a classifier on a dataset that is not human readable, without knowledge of what the data consists of.",CategorizationAccuracy,challenges-in-representation-learning:-the-black-box-learning-challenge 623,,,The Random Number Grand Challenge,,Decode a sequence of pseudorandom numbers,MAE,the-random-number-grand-challenge 624,,,Learning Social Circles in Networks,,Model friend memberships to multiple circles,FacebookCircles,learning-social-circles-in-networks 625,,,Photo Quality Prediction,,"Given anonymized information on thousands of photo albums, predict whether a human evaluator would mark them as 'good'.",CappedBinomialDeviance,photo-quality-prediction 626,,,UPenn and Mayo Clinic's Seizure Detection Challenge,,Detect seizures in intracranial EEG recordings,MCAUC,upenn-and-mayo-clinics-seizure-detection-challenge 627,,,Open Images 2019 - Visual Relationship,,Detect pairs of objects in particular relationships,OpenImagesVisualRelations,open-images-2019-visual-relationship 628,,,Personalized Web Search Challenge,,Re-rank web documents using personal preferences,NDCG@{K},personalized-web-search-challenge 629,,,ICDAR2013 - Gender Prediction from Handwriting,,Predict if a handwritten document has been produced by a male or a female writer,LogLoss,icdar2013-gender-prediction-from-handwriting 630,,,Deloitte/FIDE Chess Rating Challenge,,"This contest, sponsored by professional services firm Deloitte, will find the most accurate system to predict chess outcomes, and FIDE will also bring a top finisher to Athens to present their system",CappedBinomialDeviance,deloitte/fide-chess-rating-challenge 631,,,Hash Code 2021 - Traffic Signaling,,Optimize city traffic in this extension of the 2021 Hash Code qualifier,PostProcessorKernelDesc,hash-code-2021-traffic-signaling 632,,,The Allen AI Science Challenge,,Is your model smarter than an 8th grader?,CategorizationAccuracy,the-allen-ai-science-challenge 633,,,Belkin Energy Disaggregation Competition,,Disaggregate household energy consumption into individual appliances,BelkinHammingLoss,belkin-energy-disaggregation-competition 634,,,"KDD Cup 2012, Track 2",,Predict the click-through rate of ads given the query and user information.,KddCtrAuc,"kdd-cup-2012,-track-2" 635,,,Text Normalization Challenge - Russian Language,,Convert Russian text from written expressions into spoken forms,CategorizationAccuracy,text-normalization-challenge-russian-language 636,,,AMS 2013-2014 Solar Energy Prediction Contest,,Forecast daily solar energy with an ensemble of weather models,MAE,ams-2013-2014-solar-energy-prediction-contest 637,,,RecSys2013: Yelp Business Rating Prediction,,RecSys Challenge 2013: Yelp business rating prediction,RMSE,recsys2013:-yelp-business-rating-prediction 638,,,Finding Elo,,Predict a chess player's FIDE Elo rating from one game,MAE,finding-elo 639,,,Million Song Dataset Challenge,,Predict which songs a user will listen to.,MAP@k,million-song-dataset-challenge 640,,,JPX Tokyo Stock Exchange Prediction,,Explore the Tokyo market with your data science skills,,jpx-tokyo-stock-exchange-prediction 641,,,INFORMS Data Mining Contest 2010,,The goal of this contest is to predict short term movements in stock prices. The winners of this contest will be honoured of the INFORMS Annual Meeting in Austin-Texas (November 7-10).,AUC,informs-data-mining-contest-2010 642,,,Practice Fusion Diabetes Classification,,Identify patients diagnosed with Type 2 Diabetes,LogLoss,practice-fusion-diabetes-classification 643,,,CONNECTOMICS,,Reconstruct the wiring between neurons from fluorescence imaging of neural activity,AUC,connectomics 644,,,EMI Music Data Science Hackathon - July 21st - 24 hours,,Can you predict if a listener will love a new song?,RMSE,emi-music-data-science-hackathon-july-21st-24-hours 645,,,Global Energy Forecasting Competition 2012 - Wind Forecasting,,A wind power forecasting problem: predicting hourly power generation up to 48 hours ahead at 7 wind farms,RMSE,global-energy-forecasting-competition-2012-wind-forecasting 646,,,The ICML 2013 Whale Challenge - Right Whale Redux,,Develop recognition solutions to detect and classify right whales for BIG data mining and exploration studies,AUC,the-icml-2013-whale-challenge-right-whale-redux 647,,,Greek Media Monitoring Multilabel Classification (WISE 2014),,Multi-label classification of printed media articles to topics,MeanFScore,greek-media-monitoring-multilabel-classification-(wise-2014) 648,,,Large Scale Hierarchical Text Classification,,"Classify Wikipedia documents into one of 325,056 categories",MacroFScore,large-scale-hierarchical-text-classification 649,,,IJCNN Social Network Challenge ,,This competition requires participants to predict edges in an online social network. The winner will receive free registration and the opportunity to present their solution at IJCNN 2011.,AUC,ijcnn-social-network-challenge- 650,,,Algorithmic Trading Challenge,,Develop new models to accurately predict the market response to large trades.,RMSE,algorithmic-trading-challenge 651,,,Psychopathy Prediction Based on Twitter Usage,,Identify people who have a high degree of Psychopathy based on Twitter usage.,MCAP,psychopathy-prediction-based-on-twitter-usage 652,,,Facebook II - Mapping the Internet,,Round II of the Facebook Recruiting Competition. ,AUC,facebook-ii-mapping-the-internet 653,,,EMC Data Science Global Hackathon (Air Quality Prediction),,Build a local early warning systems to accurately predict dangerous levels of air pollutants on an hourly basis.,MAE,emc-data-science-global-hackathon-(air-quality-prediction) 654,,,dunnhumby & hack/reduce Product Launch Challenge,,The success or failure of a new product launch is often evident within the first few weeks of sales. Can you predict a product's destiny? ,RMSLE,dunnhumby-&-hack/reduce-product-launch-challenge 655,,,Sorghum -100 Cultivar Identification - FGVC 9,,Identify crop varietals,,sorghum--100-cultivar-identification-fgvc-9 656,,,I’m Something of a Painter Myself,,Use GANs to create art - will you be the next Monet?,,i’m-something-of-a-painter-myself 657,,,Wikipedia - Image/Caption Matching,,Retrieve captions based on images,NDCG@{K},wikipedia-image/caption-matching 658,,,Global Energy Forecasting Competition 2012 - Load Forecasting,,A hierarchical load forecasting problem: backcasting and forecasting hourly loads (in kW) for a US utility with 20 zones.,WRMSE,global-energy-forecasting-competition-2012-load-forecasting 659,,,Allstate Claim Prediction Challenge,,A key part of insurance is charging each customer the appropriate price for the risk they represent.,NormalizedGini,allstate-claim-prediction-challenge 660,,,March Machine Learning Mania 2021 - NCAAM - Spread,,Predict the margin of victory in the 2021 men's tournament,RMSE,march-machine-learning-mania-2021-ncaam-spread 661,,,Hotel-ID to Combat Human Trafficking 2021 - FGVC8,,Recognizing hotels to aid Human trafficking investigations,MAP@{K},hotel-id-to-combat-human-trafficking-2021-fgvc8 662,,,Wikipedia's Participation Challenge,,This competition challenges data-mining experts to build a predictive model that predicts the number of edits an editor will make five months from the end date of the training dataset. ,RMSLE,wikipedias-participation-challenge 663,,,Personality Prediction Based on Twitter Stream,,Identify the best performing model(s) to predict personality traits based on Twitter usage,MCAP,personality-prediction-based-on-twitter-stream 664,,,Billion Word Imputation,,Find and impute missing words in the billion word corpus,LevenshteinMean,billion-word-imputation 665,,,Herbarium 2022 - FGVC9,,Identify plant species of the Americas from herbarium specimens,,herbarium-2022-fgvc9 666,,,EMC Israel Data Science Challenge,,Match source code files to the open source code project,MulticlassLossOld,emc-israel-data-science-challenge 667,,,See Click Predict Fix - Hackathon,,Predict which 311 issues are most important to citizens,RMSLE,see-click-predict-fix-hackathon 668,,,Herbarium 2021 - Half-Earth Challenge - FGVC8,,"Identify plant species of the Americas, Oceania and the Pacific from herbarium specimens",MacroFScore,herbarium-2021-half-earth-challenge-fgvc8 669,,,MLSP 2013 Bird Classification Challenge,,"Predict the set of bird species present in an audio recording, collected in field conditions.",AUC,mlsp-2013-bird-classification-challenge 670,,,The ICML 2013 Bird Challenge,,Identify bird species from continuous audio recordings,AUC,the-icml-2013-bird-challenge 671,,,ImageNet Object Localization Challenge,,Identify the objects in images,ImageNetObjectLocalization,imagenet-object-localization-challenge 672,,,Mapping Dark Matter,,Measure the small distortion in galaxy images caused by dark matter,RMSE,mapping-dark-matter 673,,,March Machine Learning Mania 2021 - NCAAW - Spread,,Predict the margin of victory in the 2021 women's tournament,RMSE,march-machine-learning-mania-2021-ncaaw-spread 674,,,World Cup 2010 - Confidence Challenge,,The Confidence Challenge requires competitors to assign a level of confidence to their World Cup predictions. ,Custom,world-cup-2010-confidence-challenge 675,,,Flavours of Physics: Finding τ → μμμ (Kernels Only),,Identify a rare decay phenomenon,CernWeightedAuc,flavours-of-physics:-finding-τ--→--μμμ-(kernels-only) 676,,,Western Australia Rental Prices ,,Predict rental prices for properties across Western Australia,,western-australia-rental-prices- 677,,, iNaturalist Challenge at FGVC5,,"Long tailed classification challenge spanning 8,000 species.",MeanBestErrorAtK,-inaturalist-challenge-at-fgvc5 678,,,Kore 2022 - Beta,,Collect the maximum amount of Kore against your opponents,,kore-2022-beta 679,,,R Package Recommendation Engine,,The aim of this competition is to develop a recommendation engine for R libraries (or packages). (R is opensource statistics software.),AUC,r-package-recommendation-engine 680,,,iMaterialist (Fashion) 2020 at FGVC7 ,,Fine-grained segmentation task for fashion and apparel,IntersectionOverUnionObjectSegmentationWithF1,imaterialist-(fashion)-2020-at-fgvc7- 681,,,Tourism Forecasting Part One,,Part one requires competitors to predict 518 tourism-related time series. The winner of this competition will be invited to contribute a discussion paper to the International Journal of Forecasting.,Custom,tourism-forecasting-part-one 682,,,Multi-modal Gesture Recognition,,Recognize gesture sequences in video and depth data from Kinect,GestureNormalizedLevenshteinMean,multi-modal-gesture-recognition 683,,,Flu Forecasting,,"Predict when, where and how strong the flu will be",,flu-forecasting 684,,,Just the Basics - Strata 2013,,"Live from Santa Clara, CA - Core Data Science Skills with Kaggle’s Top Competitors",AUC,just-the-basics-strata-2013 685,,,CHALEARN Gesture Challenge,,Develop a Gesture Recognizer for Microsoft Kinect (TM),GestureNormalizedLevenshteinMean,chalearn-gesture-challenge 686,,,Image Matching Challenge 2022,,Register two images from different viewpoints,,image-matching-challenge-2022 687,,,Eye Movements Verification and Identification Competition,,Determine how people may be identified based on their eye movement characteristic.,MulticlassLossOld,eye-movements-verification-and-identification-competition 688,,,Risky Business,,Predict the risk of customer credit default,,risky-business 689,,,ICFHR 2012 - Arabic Writer Identification,,Identify which writer wrote which documents.,CategorizationAccuracy,icfhr-2012-arabic-writer-identification 690,,,Tourism Forecasting Part Two,,Part two requires competitors to predict 793 tourism-related time series. The winner of this competition will be invited to contribute a discussion paper to the International Journal of Forecasting.,Custom,tourism-forecasting-part-two 691,,,iWildcam 2021 - FGVC8,,Count the number of animals of each species present in a sequence of images,MCRMSE,iwildcam-2021-fgvc8 692,,,ICDAR2013 - Handwriting Stroke Recovery from Offline Data,,Predict the trajectory of a handwritten signature,RMSE,icdar2013-handwriting-stroke-recovery-from-offline-data 693,,,Cervical Cancer Screening,,Help prevent cervical cancer by identifying at-risk populations,,cervical-cancer-screening 694,,,As the World Churns,,Predict which customers will leave an insurance company in the next 12 months.,,as-the-world-churns 695,,,ICDAR 2011 - Arabic Writer Identification,,This competition require participants to develop an algorithm to identify who wrote which documents. The winner will be honored at a special session of the ICDAR 2011 conference. ,MAE,icdar-2011-arabic-writer-identification 696,,,CHALEARN Gesture Challenge 2,,Develop a Gesture Recognizer for Microsoft Kinect (TM),GestureNormalizedLevenshteinMean,chalearn-gesture-challenge-2 697,,,GeoLifeCLEF 2022 - LifeCLEF 2022 x FGVC9,,Location-based species presence prediction,,geolifeclef-2022-lifeclef-2022-x-fgvc9 698,,,CPROD1: Consumer PRODucts contest #1,,Identify product mentions within a largely user-generated web-based corpus and disambiguate the mentions against a large product catalog.,MeanFScoreVariant,cprod1:-consumer-products-contest-#1 699,,,iMaterialist Challenge at FGVC 2017,,Can you assign accurate description labels to images of apparel products?,MeanBestErrorAtK,imaterialist-challenge-at-fgvc-2017 700,,,Raising Money to Fund an Organizational Mission,,Help worthy organizations more efficiently target and recruit loyal donors to support their causes. ,AverageAmongTopP,raising-money-to-fund-an-organizational-mission 701,,,Semi-Supervised Feature Learning,,There's been a lot of recent work done in unsupervised feature learning for classification and there are a ton of older methods that also work well. The purpose of this competition is to find out which of these methods work best on relatively large-scale high dimensional learning tasks.,AUC,semi-supervised-feature-learning 702,,,Data Mining Hackathon on BIG DATA (7GB) Best Buy mobile web site,,Predict which BestBuy product a mobile web visitor will be most interested in based on their search query or behavior over 2 years (7 GB).,MAP@k,data-mining-hackathon-on-big-data-(7gb)-best-buy-mobile-web-site 703,,,Challenges in Representation Learning: Multi-modal Learning,,The multi-modal learning challenge,AUC,challenges-in-representation-learning:-multi-modal-learning 704,,,Forecast Eurovision Voting ,,"This competition requires contestants to forecast the voting for this year's Eurovision Song Contest in Norway on May 25th, 27th and 29th. ",AE,forecast-eurovision-voting- 705,,,Boston Data Festival Hackathon,,Can you make a better prediction than a monkey with a dart?,,boston-data-festival-hackathon 706,,,Hotel-ID to Combat Human Trafficking 2022 - FGVC9,,Recognizing hotels to aid Human trafficking investigations,,hotel-id-to-combat-human-trafficking-2022-fgvc9 707,,,Will I Stay or Will I Go?,,Predict which of our current customers will stay insured with us for an entire policy term. ,,will-i-stay-or-will-i-go? 708,,,Prescription Volume Prediction,,Predict future prescription volume,,prescription-volume-prediction 709,,,MasterCard - Data Cleansing Competition,,Improve the quality of information within transaction data,,mastercard-data-cleansing-competition 710,,,iWildCam 2022 - FGVC9,,Count the number of animals in a sequence of images,,iwildcam-2022-fgvc9 711,,,Visualize the State of Public Education in Colorado,,"Using 3 years of school grading data supplied by the Colorado Department of Education and R-Squared Research, visually uncover trends in the Colorado public school system.",RMSE,visualize-the-state-of-public-education-in-colorado 712,,,Leaping Leaderboard Leapfrogs,,Provide creative visualizations of the Kaggle leaderboard ,RMSE,leaping-leaderboard-leapfrogs 713,,,Predicting Parkinson's Disease Progression with Smartphone Data,,Can we objectively measure the symptoms of Parkinson’s disease with a smartphone? We have the data to find out!,RMSE,predicting-parkinsons-disease-progression-with-smartphone-data 714,,,Introducing Kaggle Scripts,,Your code deserves better,RMSE,introducing-kaggle-scripts 715,,,Google Cloud & NCAA® March Madness Analytics,,Uncover the madness of March Madness®,,google-cloud-&-ncaa®-march-madness-analytics 716,,,Acea Smart Water Analytics ,,"Can you help preserve ""blue gold"" using data to predict water availability?",,acea-smart-water-analytics- 717,,,LearnPlatform COVID-19 Impact on Digital Learning,,Use digital learning data to analyze the impact of COVID-19 on student learning,,learnplatform-covid-19-impact-on-digital-learning 718,,,NFL Big Data Bowl 2022,,Help evaluate special teams performance,,nfl-big-data-bowl-2022 719,,,2021 Kaggle Machine Learning & Data Science Survey,,The most comprehensive dataset available on the state of ML and data science,,2021-kaggle-machine-learning-&-data-science-survey 720,,,Excellence in Research Award (Phase II),,WiDS Datathon Further Examines the Impacts of Climate Change,,excellence-in-research-award-(phase-ii) 721,,,World Cup 2010 - Take on the Quants,,Quants at Goldman Sachs and JP Morgan have modeled the likely outcomes of the 2010 World Cup. Can you do better?,Custom,world-cup-2010-take-on-the-quants 722,,,Getting Started,,Create a forum for New Users,RMSE,getting-started 723,,,Google Cloud & NCAA® ML Competition 2020-NCAAM,,Apply Machine Learning to NCAA® March Madness®,LogLoss,google-cloud-&-ncaa®-ml-competition-2020-ncaam 724,,,Google Cloud & NCAA® ML Competition 2020-NCAAW,,Apply Machine Learning to NCAA® March Madness®,LogLoss,google-cloud-&-ncaa®-ml-competition-2020-ncaaw 725,"'`Hier geht's zur OED19classification Kaggle Competition Diese Seite dient als Sekundrmaterial zur Vorstellung von Kaggle Competitions im Rahmen des Open Education Day 2019. Es handelt sich um eine fiktive Regressionsaufgabe. In train.csv sind einige Funktionswertpaare (xi,yi) gegeben. Die Aufgabe besteht darin, fr die x-Werte in test.csv eine Vorhersage zu erstellen. Die Daten sind frei erfunden. Stellen Sie sich vor, die Messwerte seien ein Mass fr die Algenverschmutzung in Ihrem Privatteich. Gegeben sind die Messwerte der ersten 20 Sommertage- wie wird die Algenverschmutzung wohl in den nchsten 20 Tagen zunehmen? Machen Sie eine Vorhersage fr die Zeitpunkte, die in Xtest gegeben sind! Stellen Sie sich vor, die Messwerte seien ein Mass fr die Algenverschmutzung in Ihrem Privatteich. Gegeben sind die Messwerte der ersten 20 Sommertage- wie wird die Algenverschmutzung wohl in den nchsten 20 Tagen zunehmen? Machen Sie eine Vorhersage fr die Zeitpunkte, die in Xtest gegeben sind!`'",tabular data,Einfhrung in Kaggle InClass Competitions,,,mse,einfhrung-in-kaggle-inclass-competitions 726,"'`BikeSharingDSG test BikeSharingDemand2011120126train, 20127~12testtest50%public,private 2120 submit 120() Base_submission notebook DR_submission AutoML Best_submission `'",tabular data,Bike Sharing Demand for Education(),,,rmsle,bike-sharing-demand-for-education() 727,"'`Context Pneumonia is an infection that causes inflammation in the lungs. It will be caused by a virus, bacteria, fungi, etc. Radiologists check chest X-ray images to look for white spots called infiltrates in the lungs, that identify an infection. Content The original dataset is in Kaggle datasets. This dataset is created by modifying ""Chest X-Ray Images (Pneumonia)"" (Licensed under CC BY 4.0) This dataset has 5,528 X-Ray images (JPEG) and 2 categories (Pneumonia/Normal). Training Data and Test Data The training data is randomly chosen 3,869 (70%) images , 2,991 represents pneumonia, 878 are normal. The test data is the remaining 1,659 (30%) images, 1,282 represents pneumonia, 377 images are normal. Reference Paul Mooney: Chest X-Ray Images (Pneumonia), Kaggle datasets Daniel Kermany, Kang Zhang, Michael Goldbaum: Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification, Mendeley Radiological Society of North America, Inc.: Pneumonia, RadiologyInfo.org for patients`'",image data,Pneumonia Diagnosis,inClass,The 9th 1056Lab Data Analytics Competition (Extra),auc,pneumonia-diagnosis 729,"'`This is the first inclass kaggle competition for ACM ML SMP Summer '19 The goal is to learn how to go about a kaggle competition and make the process seem less daunting. Please go through the starter kernel. Feel free to explore other kernels for ideas and hacks that could improve your score on the leader board By the end of this competition we hope you learn the importance of cross validation schemes, experiment with feature engineering, feature scaling, exploratory data analysis, hyper parameter tuning, feature encoding and maybe even other machine learning models like xgboost and lightgbm etc. The goal is to learn as much as possible`'",tabular data,ACM Summer'19 Inclass-1,inClass,Here is a kaggle competition using a Graduate Admission dataset,rmse,acm-summer19-inclass-1 730,'`Homework competition with cars classification`',image data,car-classification,inClass,Car classification competition,categorizationaccuracy,car-classification 731,"'`Este problema desafia voc a desenvolver um modelo que detecte sarcasmo no texto. Foi disponibilizado um dataset com 26709 amostras, sendo 18696 para teste.`'",text data,Sarcasmo,inClass,20642 - Aprendizado de Máquina - Universidade de Brasília,auc,sarcasmo 732,"'`You like Molson's, eh?`'",tabular data,compass-canada,inClass,"if you know, you know...eh",rmse,compass-canada 733,"'` 1 NLP. multilabel : , . pipeline : feature extraction : gensim fasttext Glove BMEmb spacy nltk sklearn.preprocessing , .`'",text data,Bad comments,inClass,Соревнование в рамках Лабораторной работы №1,mcauc,bad-comments 734,"'`Ol Aluno DSA! Seja bem-vindo Edio de Fevereiro/2019 da Competio DSA de Machine Learning. Esta uma excelente oportunidade para voc praticar tudo que aprendeu nos cursos da Data Science Academy, desenvolver suas habilidades de soluo de problemas e ainda ganhar prmios. Seu objetivo nesta competio a explorao do conjunto de dados NYC Benchmarking que mede 60 variveis relacionadas ao uso de energia para mais de 11.000 edifcios na cidade de New York. Voc deve construir um modelo preditivo capaz de prever o Energy Star Score, que frequentemente utilizado como uma medida agregada da eficincia global de um edifcio. O Energy Star Score uma medida percentual do desempenho energtico de um prdio calculado a partir do uso de energia. Mais detalhes na pgina ""Definindo o Problema"" aqui nesta competio. Premiao para os 3 primeiros colocados: 1. Primeiro colocado - Monitor 24"" da Samsung 2. Segundo colocado - Raspberry Pi 3 3. Terceiro colocado - Carto presente na Livraria Saraiva no valor de R$100 Organizao: Data Science Academy www.datascienceacademy.com.br Boa sorte e bons estudos! Equipe DSA`'",tabular data,Competio DSA de Machine Learning,,,mae,competio-dsa-de-machine-learning 735,"'`Introduction (KISTI) ""2019 "" . Academic , . Competition background RMS . 1912 4 15 , , 2224 1502 . , . . , , . , . Acknowledgement . KISTI`'",tabular data,2019 ML competition with KISTI,,,categorizationaccuracy,2019-ml-competition-with-kisti 736,"'`Context Japan has four distinct seasons. About 1,300 AMeDAS (Automated Meteorological Data Acquisition System) stations automatically record weather data such as temperature, sunshine, and windspeed. Temperature is a seasonal time-series data. It contains a seasonality component and there may be a trend caused by climate change. Content This dataset is about minimum temperature forecasting. This dataset is acquired from Japan Meteorological Agency. The data contains 4,018 consecutive instances (days), the first 3,653 instances are for training, the last 365 instances are for evaluation. This competition is inspired from Daily Minimum Temperatures in Melbourne dataset in Kaggle. Training Data and Test Data The training data is 10-year (3,653-day) consecutive data, that includes minimum air temperature, maximum air temperature, sunshine, rainfall, and mean windspeed. The test data is the subsequent 365 days. You can also use maximum air temperature, sunshine, rainfall, and mean windspeed in training data, but test data does not contain them. Reference Japan Meteorological Agency Paul Brabban: Daily Minimum Temperatures in Melbourne, Kaggle Datasets`'",tabular data,Temperature Forecasting,inClass,The 10th 1056Lab Data Analytics Competition,rmse,temperature-forecasting 737,"'`Enonc du TD (Analyse de sentiments) Le TD consiste finir l'implmentation d'un prdicteur naf de la polarit d'une critique de film tire de IMDB anglais. Le modle est un modle de rgression logistique traditionnel mais implment avec pytorch. Le jeu de donnes a t tlcharg depuis le site suivant : http://ai.stanford.edu/~amaas/data/sentiment/ Le TD consiste complter les trous laisss dans le Kernel Point de dpart: Forker le kernel Point de dpart Implmenter le chargement de donnes de test dans le notebook dans la cellule marque cet effet Ajouter une mthode run_test() la classe SentimentAnalyser pour qu'elle produise un jeu de test stock dans un fichier csv. Soumettre des tests ainsi gnrs la comptition Augmenter la mthode train() de la classe SentimentAnalyser pour diviser le jeu de donnes d'entrainement en deux sous parties : entrainement et validation (voir librairie torch_utils.data.random_split). Le corps de la mthode sera modifi pour (1) raliser une valuation sur les donnes de validation chaque poque (2) sauvegarder au final le modle qui minimise la perte sur ces donnes de validation. Faire une recherche d'hyper-paramtres (nombre d'pochs et learning rate). Le rendu attendu est le notebook de Point de dpart qui aura t complt. Profitez du framework notebook pour commenter vos rponses si besoin. Barme Elments indicatifs de notation : Qualit du code (1) Avoir soumis au moins un jeu de test valide dans la comptition (1) Avoir soumis un jeu de test de rsultat nettement suprieur la baseline (2) Avoir soumis un jeu de test de rsultat suprieur au corrig du prof (1)`'",text data,AS-bow-2019-2020,inClass,Analyse de sentiments par sacs de mots (2019-2020),categorizationaccuracy,as-bow-2019-2020 738,"'`This Kaggle challenge is your last evaluation in the SMEMI309 class (Computational Statistical Methods, by Prof. Jean Martinet) in 2020. The challenge will be held between Nov 19 (Thursday) and December 10 (Thursday) 20.00. Time is in UTC. NEW: teams of 3 students max allowed (not mandatory). Context The objective is to classify data generated by a Spiking Neural Network (SNN). The general approach is described in this paper. Warning This challenge contains real, unexplored, freshly generated experimental data from the on-going research of our lab. Students may or may not find useful results during the challenge. Inversely, the data may or may not be too easy to process. This makes the challenge realistic, exciting, and challenging. This also makes your work particularly important since your results might be useful for a research group. The main point is to find relevant answers, by any means. You are free to use any technique learnt in the class or elsewhere. Experimental settings and data description A webcam is plugged to an event-camera simulator, and shown several stimuli, in the form of object motion in four directions (up, down, left, right). The event-camera simulator is a piece of software written to stimulate an event-camera. The event data is fed to a one-layer SNN with 10 output neurons, and the output activity is recorded for a fixed short duration. One sample output vector is made of the output neuron spike code during this duration, i.e. 10 integer values. We will study three different neural encoding. Rate coding (output neuron spike count - onsc): The code is made of the total count of spikes during the stimulation. Temporal latency coding (output neuron spike temporal::first - onstf): The code, also called time-to-first-spike, is made of the latency of the first spike, i.e. the time between stimulus and the first spike. Temporal rank-order coding (output neuron spike temporal::order - onsto): The code is made of the order of arrival of first spikes. The datasets are obtained by presenting an object (a hand) moving in translation before the webcam with a fixed speed, in one of the four directions. This has been repeated 80 times for each class. Original 320 samples has been rotated by 90, 180, and 270 degrees for augmentation. Moreover, this data augmentation removes any bias related to the video acquisition, since each sequence now belongs in all four classes. The resulting 1280 event sequences are fed to the network, and we obtain 320 10-D vectors, that are split into train / test sets. Classes are balanced in all sets. Class numbers are 0 for up, 1 for down, 2 for left, and 3 for right. There are three datasets: BEFORE_TRAINING (BT) obtained on a randomly initialised SNN, AFTER_TRAINING with 2 classes (AT2) obtained with the same network (random init.), after a short STDP-based unsupervised training using just two classes (for you to find out!), and AFTER_TRAINING with 4 classes (AT4), obtained with the same network (trained with 2 classes), after a short STDP-based unsupervised training using all four classes. The purpose of separating AT2 and AT4 datasets is to verify if the network is able to successfully lean 2 classes, and then to successfully learn 4 classes, with continuous adaptation of varying number of classes (continuous learning). The main dataset is AT4. This means that you need to use AT4_train to train, and AT4_test to generate your submission file. Note that all test datasets are generated using four classes, even AT2. All three encoding schemes are in single files in the following order: label, 10 columns for count, 10 columns for temporal::first, 10 columns for temporal::order BT_train.csv, AT2_train.csv and AT4_train.csv (1280 rows, 31 columns) label, onsc1, onsc2, onsc10, onstf1, onstf2, onstf10, onso1, onso2, onso10 1,4,4, 8,0.6002,0.6002, 0.5202,6,6, 0 1,0,0, 6,3.01,3.01, 0.6402,-1,-1, 1 0,2,3, 5,0.7602,0.8801999999999999, 0,0, 0 etc. Importante notes: ""-1"" means that the neuron did not spike the arbitrary number of digit gives you full precision but digits were rounded to evaluate rank orders. BT_test.csv, AT2_test.csv and AT4_test.csv (320 rows, 30 columns because of course, no label!) onsc1, onsc2, onsc10, onstf1, onstf2, onstf10, onso1, onso2, onso10 3,3, ,9,0.6402,0.6402, ,0.6002,4,4, ,0 3,2, 4,0.8801999999999999,0.8801999999999999, 0.8402,4,4, ,0 4,3, 4,1.1201999999999999,1.1201999999999999, 1.1201999999999999,1,1,1 etc. sampleSubmission.csv (320 rows, 2 columns) id, label 0, 0 1, 0 2, 0 3, 0 4, 0 etc. Expected work and results Four parts are expected (the analysis should be done for ALL 3 encoding schemes): A statistical analysis to determine whether the data in BT_train.csv, AT2_train.csv, and AT4_train.csv significantly differ from one another. Of course, the hypothesis under test is that the data differ. The main task in this challenge is the training of a model based on AT4_train.csv to correctly classify AT4_test.csv. You are expected to work on a Kaggle notebook and to post frequently your results (2 submissions per day allowed). Another task (not to be posted) consists in training a model on BT_train.csv to correctly classify BT_test.csv. Do not try too hard to get good results with this dataset, since we expect (and even hope) this result to be bad. The last task consists in checking the continuous learning capability of the network. Please give explicit sections names in your report. In your conclusion, explicitly state what neural coding enables the best classification? Grade The grade will take into account (subject to modifications): (4/10 points) a synthetic description of your solution, including the statistical analysis and detailed explanation of your choices, and why you believe they are good (4-pages max pdf) -- PDF not ipynb. The report is to be submitted on LMS (4/10 points) the last submitted solution for AT4_test (2/10 points) the final challenge ranking (max for first) Note that the ranking formula is coef x [ 1 - (rank-1) / (nbTeams-1) ]`'",tabular data,SMEMI309 - Final evaluation challenge 2020,inClass,Classify SNN output data by all possible means,categorizationaccuracy,smemi309-final-evaluation-challenge-2020 739,"'`This third assignment picks up from your second assignment, where you explored a dataset. Now, you will build a classifier to predict the QuoteConversionFlag of the Insurance Marketing Data. The classification goal is to predict whether the customer will buy the insurance or not (target attribute: QuoteConversionFlag (binary: '0', '1')). Important: only users with their UTS student ID as a user-name will be eligible for the class prize.`'",tabular data,2019S UTS Data Analytics Assignment 3,inClass,Predict whether the customer will buy the insurance or not !!,auc,2019s-uts-data-analytics-assignment-3 740,'`A small regression problem.`',image data,Predict the missing pixel value v2,inClass,A simple regression tasked derived from the image of my cat.,mse,predict-the-missing-pixel-value-v2 741,"'`Don't Overfit Images! Prove your natural intelligence by building artificial intelligence. The last decade was the golden age of machine learning with the increase in datasets and computation power. Deeplearning is solving a wide variety of problems with large data-sets available. However there are lot of problems which doesn't have large data-sets When trained on these datasets Neural networks tend to overfit and performs badly on test data. Can you prove machine learning can predict accurately with less overfitting on small datasets too?. Your task Given 500 brain scan images and class labels, you need to identify the type of brain tumor present in 2564 images in test set. This is a multiclass image classification problem where you required to classify each image into one of the 3 class labels. See the Data section for more information about the dataset. Evaluation metrics used is Mean F1Score. Check the Evaluation section for more details on it. Click here for GuideLines on how to make a submission through kernel During the contest, private leaderboard will show scores only on 50% of test data. Public Leaderboard will display scores on complete test data will only be available once the contest ends. One proves his expertise in machine learning only when he masters the art of not overfitting!. All the best!`'",image data,Anokha AI Adept,inClass,"This machine learning contest is part of Anokha, a national level TechFest",meanfscore,anokha-ai-adept 742,'` : ?`',image data, ?,,,auc,--? 743,"'`Context This dataset is about diabetes that is a lifelong condition that causes a person's blood sugar level to become too high [NHS]. There are 2 main types of diabetes: type 1 diabetes where the body's immune system attacks and destroys the cells that produce insulin type 2 diabetes where the body does not produce enough insulin or the body's cells do not react to insulin Type 2 diabetes is far more common than type 1. In the UK, around 90% of all adults with diabetes have type 2. The key figures about diabetes are [IDF]: 1 in 11 adults (20-79 years) have diabetes (463 million people). 1 in 2 adults with diabetes are undiagnosed (232 million people). 1 in 5 people with diabetes are above 65 years old (136 million people). 10% of global health expenditure is spent on diabetes (USD760 billion). 1 in 6 live births (20 million) is affected by hyperglycemia in pregnancy, 84% of which have gestational diabetes. 3 in 4 (79%) of people with diabetes live in low- and middle-income countries. Over 1.1 million children and adolescents below 20 yeas have type 1 diabetes. 1 in 13 adults (20-79 years) have impaired glucose tolerance (374 million people). 2 in 3 people with diabetes live in urban areas (310.3 million). Content Original data came from the Biostatistics program at Vanderbilt. This dataset was downloaded from data.world. The downloaded dataset contains 390 instances (patients) including 60 instances are diabetes and the remaining 330 are not diabetes. Any patient without a Hemoglobin A1c was excluded. If their Hemoglobin A1c was 6.5 or greater they were labeled with diabetes = 1. The downloaded dataset contains Glucose but it was deleted from this dataset because it is strongly related to diabetes. Training Data and Test Data The training data is randomly chosen 273 (70%) instances. The test data is the remaining 117 (30%) instances. Notice This is a private competition in 1056Lab, a data mining laboratory at Chubu University. Reference Diabetes Prediction - data.world Diabetes - The National Health Service (NHS), UK IDF DIABETES ATLAS, 9th edition 2019 - International Diabetes Federation`'",tabular data,Diabetes Diagnosis,inClass,The 17th 1056Lab Data Analytics Competion,auc,diabetes-diagnosis 744,"'`The goal of this competition is to predict star rating associated with user reviews from Amazon Movie Reviews using the available features. You are allowed to use any technique used in class for your predictions, as well as classical machine learning algorithms, like random forests, regression trees, etc. Using deep learning models, or any other related technique, that lies far from the syllabus of this class is prohibited. What we mainly seek from this competition - besides performance- is smart ways to make sense of the data, construct new features from the available metadata, and understand your thought procedure. In addition to submitting your solution online, you need to provide us with a 2-page writeup that describes the algorithm you have implemented and the special tricks you used in order to make it work (or improve). It is important that you show us your thought procedure. Also, describe your strategy for selecting that particular algorithm and how you did your offline evaluation. Note that some sort of offline evaluation is required. Your writeup should not exceed 2 pages under any circumstance, else it will not be graded. Your grade will be determined based on your performance at the Kaggle competition, as well from a report that you need to send us at cs506kaggle@gmail.com. Please email your Kaggle username to cs506kaggle@gmail.com and clearly mention it on your project writeup as well. Have fun and good luck!!`'",text data,BU CS506 Spring 2020 Midterm,inClass,The goal of this competition is to predict star ratings using the Amazon Movie Reviews dataset.,rmse,bu-cs506-spring-2020-midterm 745,"'` . , , : , , . . , . (), . . () . , .`'",image data, (level 1),,,dice,-----(level-1) 746,"'`Context What determines the price of used cars? The value of a car drops right from the moment it is bought and the depreciation continues with each passing year. In fact, in the first year itself, the value of a car decreases by 20 percent of its initial value. The make and model of a car, total kilometers driven, overall condition of the vehicle and various other factors further affect the cars resale value [CarDekho]. Content The original dataset is in the Kaggle Datasets. This data contains 6,019 instances (cars) includes 3,205 diesel cars, 2,746 petrol cars, 56 CNG cars, 10 LPG cars, and 2 electric cars (probably) in India. Training Data and Test Data The training data is randomly chosen 4,213 (70%) instances. The test data is the remaining 1,806 (30%) instances. Notice This is a private competition in 1056Lab, a data mining laboratory at Chubu University. Reference Avi Kasliwal: User Cars Price Prediction - Kaggle Datasets Frequently Asked Questions - CarDekho`'",tabular data,Used Cars Price Prediction,inClass,The 18th 1056Lab Data Analytics Competition,rmsle,used-cars-price-prediction 747,"'`This goal of this competition is to build the best model you can for classifying 28x28 grayscale glyphs of the letters A - J. This dataset is similar to the MNIST dataset, but the classification task will be more challenging. Each submission you make will be scored based on its classification accuracy on a held-out test dataset. Results from scoring your model on 50% of the test dataset will be used during the competition to rank your submission (and accordingly, your model) on the public leaderboard. After the end of the competition, your model will be scored again on the remaining 50% of the test dataset, and these scores will be used to determine the final rankings. In this way, we verify that your model can generalize to unseen data. You can do your work directly on Kaggle using the Notebooks feature. This will give you access to GPUs for faster computation as well. This is a friendly competition for educational purpose and bragging rights; as an added incentive, the team with the highest-ranked model after the close of the competition will earn a 10% bonus on their project grade. See myCourses for details of project grading.`'",image data,COMP 750/850 Project 1,inClass,Develop machine learning models to classify notMNIST images.,categorizationaccuracy,comp-750/850-project-1 748,"'`Prediction of RNA Binding Sites in a Protein CONTEST OVERVIEW: This contest is going to be a beginner level Kaggle challenge and this is a great opportunity to take a step forward if you are a beginner in Machine Learning and Data Science. All you need basic python knowledge and a zeal to learn. If you devote the required amount of time and efforts, by the end of this contest youll realise that youve grown a lot. PROBLEM OVERVIEW: Proteins are the molecular workforces in living organisms. They perform a broad range of essential functions. They catalyze metabolic reactions, replicate DNA, respond to stimuli, provide movement, and much more. So in this challenge, we will try to look upon a specific type of protein named RBPs (RNA Binding Proteins) with a specific function i.e to bind to a specific target site on RNA. RBPs bind to specific target sites, how they identify these sites is still under research. Understanding how these proteins bind to a specific RNA is of great importance as it can provide us with another way to identify target proteins with similar functions. So the identification of RBPs and their binding sites is a major challenge in the field of molecular recognition. Determining the RNA interacting residues in a protein from its structure is quite easy but it is time-consuming and costly. So we want to develop a machine learning model to predict RNA binding sites of a protein from its amino acid sequence. Topics you may want to read on your own: Basic knowledge about proteins. Scoring Matrices EVALUATION: Participants will be evaluated based on the Area Under Curve (AUC) of ROC for their Kaggle Submissions. Plagiarism will not be tolerated and will result in disqualification. NOTE: BEWARE OF OVERFITTING!! REFERENCES: Hint related to how to build an appropriate input matrix - To capture the effects caused by evolution in the amino acid sequence PSSM matrix can be used. (https://www.cs.rice.edu/~ogilvie/comp571/2018/09/11/pssm.html) A good Applied Machine Learning course for Beginners: https://www.udacity.com/course/intro-to-machine-learning--ud120 Wondering how to apply classification techniques to your input data you can refer to https://towardsdatascience.com/building-classification-models-with-sklearn-6a8fd107f0c1 . `'",tabular data,The Kaggle Master,,Prediction of RNA Binding Sites in a Protein,auc,the-kaggle-master 749,"'`AILAB ML Training #1 Training objectives: Understand public and private test system. Try validation, especially CV. Try data augmentation. Try ensemble. Apply the past competition knowledge into this ongoing competition.`'",image data,AILAB ML Training #1,inClass,kuzushiji MNIST (KMNIST) Classification,categorizationaccuracy,ailab-ml-training-#1 750,"'` YUMNIST () `'",image data,AILAB ML Training #0,inClass,MNIST Classification,categorizationaccuracy,ailab-ml-training-#0 751,"'`This ""code"" competition invites you to train a system that will mimic an unknown function with 8 variables: f(a,b,c,d,e,f,g,h) = ? Each variable is an integer in the range (1, 20) inclusive. The output is an integer which will always be within the range (1, 2000) (but the actual range of values is likely smaller). You get 500 samples to train on and need to predict 500 new values. Extra Details I wrote a simple function that takes 8 inputs and generates an output value by adding 8 independent mini-functions and then rounding the result to an integer. So in reality it looks something like this: f(a,b,c,d,e,f,g,h) = round(fn2(a,f) + fn2(b) + fn3(c,d,e) + ...) Not every term is weighted equivalently. Some terms are simple and heavily weighted. Some terms are complex and less heavily weighted. This should allow many systems to find a relatively good answer relatively easily (but it might take some extra work to get a score of 0). Operators The operators used in the functions are operations that would be considered basic. Addition, subtraction, multiplication, division plus some basic built-in operators likesqrt, sin, cos, etc.`'",tabular data,Basic Regression Competition,inClass,Find the best fit for a non-linear regression problem with 8 input variables and one output variable,mae,basic-regression-competition 752,"'`Context This dataset is about brain cancer gene expression. Brain cancers are malignant brain tumors which is a growth of abnormal cells that have formed in the brain [ABTA]. Gene expression profiling is a laboratory method that identifies all of the genes in a cell or tissue that are making messenger RNA. Messenger RNA molecules carry the genetic information that is needed to make proteins from the DNA in the nucleus of the cell to the cytoplasm where the proteins are made. A gene expression profile may be used to find and diagnose a disease or condition or to see how well the body responds to treatment [NCI]. The dataset contains 4 types of brain tumors: ependymoma, glioblastoma, medulloblastoma, and pilocytic astrocytoma. Content Original data came from CuMiDa: An Extensively Curated Microarray Database via Kaggle Datasets. The downloaded dataset contains 130 instances (patients) including 46 are ependymoma, 34 are glioblastoma, 22 are medulloblastoma, 15 are pilocytic astrocytoma, and the remaining 13 are normal. Training Data and Test Data The training data is randomly chosen 91 (70%) instances. The test data is the remaining 39 (30%) instances. License ODbL 1.0 in Kaggle Datasets. Notice This is a private competition in 1056Lab, a data mining laboratory at Chubu University. References Bruno Grisci: Brain cancer gene expression - CuMiDa - Kaggle Datasets CuMiDa: An Extensively Curated Microarray Database - SBCB Feltes, B.C.; Chandelier, E.B.; Grisci, B.I.; Dorn, M. CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research. Journal of Computational Biology, Ahead of Print, 2019. Brain Tumor Education - American Brain Tumor Association NCI Dictionary of Cancer Terms - National Cancer Institute`'",tabular data,Brain Cancer Classification,inClass,,meanfscore,brain-cancer-classification 753,"'` . . , . , . , , -> . ! . . , 3 ( ), , , 4, . RocAuc. . . , """" ( ). . ML .`'",tabular data,[SF-DST] Recommendation Challenge v4,,,auc,[sf-dst]-recommendation-challenge-v4 754,"'`Context This dataset is about the sinking of the Titanic on April 15, 1912. Titanic had approximately 1,300 passengers and 900 crews but more than 1,500 were killed by this accident. This accident had revealed many problems: the lack of lifeboats, the treatment of passenger classes during the evacuation, and the regulations. For example, there were only 20 lifeboats that could carry only 1,178 people. In fact, only 705 people were rescued in lifeboats. The goal of this task is to estimate the survivors of this accident from the passengers' list. Content Original data came from the CS109: Intro to Probability for Computer Scientists, at the Stanford University in 2016 Spring. It also thanks to Kaggle and encyclopedia-titanica for the dataset. This data contains 887 instances (passengers). Training Data and Test Data The training data is randomly chosen 621 (70%) instances. The test data is the remaining 266 (30%) instances. Notice This is a private competition in MPRG and 1056Lab at Chubu University. References CS109: Intro to Probability for Computer Scientists - Stanford University Titanic - Encyclopedia Britannica`'",tabular data,Titanic Survivors Prediction,,,auc,titanic-survivors-prediction 755,"'`Objective On this homework, you will train a CNN model to classify CIFAR-10 dataset. The CIFAR-10 dataset will be download from tensorflow.keras dataset API. You can use any task to improve the accuracy of your model. Like adding more layer on your model, using transfer learning...etc. Of course, the larger model size won't make accuracy higher. !!! Using test dataset for training is prohibited !!!`'",image data,108-2 NTUTEE HW5 tensorflow CNN basic,inClass,Inclass competition for ntutee tensorflow class,categorizationaccuracy,108-2-ntutee-hw5-tensorflow-cnn-basic 756,"'` IRIS 2 id: The unique id for each data item. SepalLengthCm:sepal length in cm () SepalWidthCm:sepal width in cm () PetalLengthCm:petal length in cm () PetalWidthCm:petal width in cm () There are two classes of flowers: 1: Iris-setosa 0: Others : 1: Iris-setosa 0: `'",image data,1082IEM DL Assigement 1,inClass,,categorizationaccuracy,1082iem-dl-assigement-1 757,"'`This is a sample evaluation for 15CSE380. Competition ends by today(09-05-2020 - 06PM IST) This is a classification problem. The value to be predicted is whether an employee will get promotion or not 20 submissions are possible for this evaluation logloss is the evaluation criteria`'",tabular data,15CSE380-NNDL-Eval,,,logloss,15cse380-nndl-eval 758,"'`This is Internal Board. Practice Competition Please test your model with internal cut dataset. 80% public/20% private`'",tabular data,[2018 Spring] Internal Board - Practice,inClass,Practice Competition,mcauc,[2018-spring]-internal-board-practice 759,"'`Introduction , ( ) . Academic , . . . https://bit.ly/2UuQvtU Competition background . . , , , . 20 , . . Acknowledgement . . ( ) Note: No official relationship with Kaggle , , , , , , , , , , , , , , , `'",tabular data,2019 2nd ML month with KaKR,,,rmse,2019-2nd-ml-month-with-kakr 760,"'`How to classify car classes ? , . , . 10 , , . , . , (Class) CCTV , . , (Class) . , !`'",image data,2019 3rd ML month with KaKR,,,meanfscore,2019-3rd-ml-month-with-kakr 761,"'`2019 2020 1 2019 2020 1 . 2019 1 1 01 2019 12 31 24 , 2020 1 1 01 2020 1 31 24 . '' . http://www.airkorea.or.kr/web/last_amb_hour_data?pMENU_NO=123`'",tabular data,SejongAI.. ,,,rmse,sejongai..-- 762,"'` : / 10 feature(, , 5 , , , ) . . Train Data: 2016, 2017, 2018 7-8 , Test Data : 2019 7-8 , [] https://www.data.go.kr/data/3057229/fileData.do [ ] >> https://data.kma.go.kr/stcs/grnd/grndTaList.do?pgmNo=70`'",,SejongAI..[ ],,,rmse,sejongai..[-----] 763,"'`2020 Athens EESTECH Challenge Dataset contains 97 speakers saying 248 different phrases. The 248 utterances map to 31 unique intents, that are divided into three slots: action, object, and location. The goal in preparing this dataset was to provide a benchmark for end-to-end spoken language understanding models. This competition is essentially a ""Speech Command Recognition"" task aiming to Assistive Technology. The underlying technology is very similar to the one used by well known industrial digital assistants (Siri, Cortana, Alexa, Google Assistant, Bixby). LICENSE This work is licensed under the Fluent Speech Commands Public License. Please take a look at the PDF in the data folder. COLLECTION Data was gathered using crowdsourcing. Participants were limited to those located in the United States and Canada. Participants were asked to say each phrase twice. The phrases to record were presented in a random order. Participants were required to consent to their speech data being released along with anonymized demographic information about themselves. The speech data was validated by a separate set of crowdsourcing workers. All audios that were deemed by the crowdsourced workers to be noisy, inaudible, unintelligible, or contain the wrong phrase were removed. SCIENTIFIC DOCUMENTATION Please find the scientific documentation related to the competition here. We will update the file during the competition, so check back. EXTERNAL DATA Please check the rules of the competition regarding external data. You may share your external data sources in the Discussion.`'",tabular data,2020 Athens EESTECH Challenge,inClass,Build an assistive chatbot able to understand speech commands,categorizationaccuracy,2020-athens-eestech-challenge 764,"'` 2020 AI. ( 18011797 ) : () : 2020/06/28 11:59 PM : https://www.data.go.kr/data/15053866/fileData.do`'",tabular data,SejongAI..[ ],,,rmse,sejongai..[--] 765,"'`(, , ) 4 class . train data 2013 1 2018 12 class test data . , . , 4 class . . : https://data.kma.go.kr/data/grnd/selectAsosRltmList.do?pgmNo=36 :https://data.kma.go.kr/data/lwi/lwiRltmList.do?pgmNo=635`'",tabular data,SejongAI..[ ],,,categorizationaccuracy,sejongai..[---] 766,"'` competition . . This is the page for Parrot members(data science group, Sogang University, South Korea , . Acknowledgements .`'",image data,Parrot 2nd computer vision competition,,,categorizationaccuracy,parrot-2nd-computer-vision-competition 767,"'`2019 : California Housing Prices`'",tabular data,4th Bigdata Kaggle EX#1,,,rmse,4th-bigdata-kaggle-ex#1 768,"'`This competition is to predict the label of a digit from a pixel representation. You must use a neural network to complete this task.`'",image data,Digit Labelling Competition,,,categorizationaccuracy,digit-labelling-competition 769,"'`De qu se trata? Una entidad bancaria necesita predecir el resultado de una llamada telefnica para saber si con la informacin recabada (o la que ya se tiene) el cliente contactado o por contactar se suscribir a un plazo fijo. Datos Datos personales y bancarios: 1- id 2 - age: edad. Numrica. 3 - job: tipo de trabajo. Categrica. 4 - marital: estado civil. Categrica. 5 - education: educacin. Categrica. 6 - default: tiene crdito impago. Categrica. 7 - housing: tiene prstamo hipotecario. Categrica. 8 - loan: tiene prstamo personal. Categrica. Relacionado con el ltimo contacto de la campaa actual: 9 - contact: medio de contacto. Categrica. 10 - month: mes del ltimo contacto. Categrica. 11 - dayofweek: dia de la semana del ltimo contacto. Categrica. Otros atributos: 12 - campaign: numero de contactos realizados durante esta campaa para este cliente. Numrica (incluye el ltimo contacto). 13 - pdays: nmero de das que pasaron desde el ltimo contacto al cliente por una campaa anterior. Numrica (999 significa que el cliente no ha sido contactado con anterioridad). 14 - previous: numero de contactos realizados previo a esta campaa para este cliente. Numrica. 15 - poutcome: resultado de la campaa de marketing anterior. Categrica. Contexto socio-econmico: 16 - emp.var.rate: Tasa de variacin de empleo. Indicador cuartlico. Numrica. 17 - cons.price.idx: ndice de precio de consumidor. Indicador mensual. Numrica. 18 - cons.conf.idx: ndice de confianza de consumidor. Indicador mensual. Numrica. 19 - euribor3m: Eurbor (3 meses). Indicador diario. Numrica. 20 - nr.employed: nmero de empleados. Indicador cuartlico. Numrica. Variable objetivo: 21 - y: se suscribi el cliente al plazo fijo? Binaria [""yes"",""no"") Nota Hay muchos valores faltantes en algunas variables categricas, todas codificadas como ""unknown"". Estos valores faltantes pueden ser tratados como una posible clase o bien se los puede eliminar, o utilizar alguna tcnica de imput.`'",tabular data,Plazo fijo en el banco,,,macrofscore,plazo-fijo-en-el-banco 770,"'`Imagine the following scenario. You are a top-flight machine learning consultant. You have been asked to solve the following problem. COVID-19 (Coronavirus) has become a global pandemic. Around the world, scientists and epidemiologists are working day and night with the goal of developing a vaccine and prevent future infections of the disease. Currently, there is a steep shortage of test kits. To effectively allocate resources, one approach is to use x-ray imaging of the chest to determine to narrow down on whether or not a patient is infected with COVID-19. This dataset contains several hundred chest x-rays. Some of the images are of a healthy chest, some are of the chest where the subject is infected with COVID-19, and some are of the chest where the subject is infected with pneumonia, SARS, or other ailments. Your task is to create a classifier capable of predicting whether the subject in the x-ray is infected with COVID-19 or is not infected with COVID-19. Scoring The training set and test set contains images of chest x-rays in which the patient either has or does not have COVID-19. The task at hand is a binary classification problem. Your machine-learning algorithm will need to predict whether the x-ray indicates a presence of COVID-19.`'",image data,CPSC 340 Final Part 2,inClass,,f_{beta},cpsc-340-final-part-2 771,"'`Le Natural Language Processing connat depuis peu un fort regain dintrt dans le monde scientifique. Cet engouement est directement li aux progrs raliss depuis quelques annes dans le domaine de lintelligence artificielle et plus particulirement dans une des branches de l'IA qui tente de donner la capacit aux machines d'apprendre de manire autonome (Ex : machine learning et deep learning). Comme chaque avance technologique peut tre une aubaine pour une entreprise, le NLP l'est particulirement pour Edisys. Edisys nous traitons une quantit importante de donnes afin de faire de l'extraction d'information ou de la classification de document qui ncessitent une comprhension smantique avance. Jusqu' prsent, seule une intervention humaine tait suffisamment efficace afin d'effectuer ce genre de traitement, mais aujourd'hui les possibilits sont autres A travers cet atelier R&D, qui se veut simpliste dans l'objectif, nous allons tenter de reproduire un traitement d'un oprateur sur l'attribution d'une ou de plusieurs thmatiques sur un avis de publicit, mais cela de manire automatise avec des techniques de NLP modernes. Ceci ne sera que la premire tape de monte en comptence dans ce domaine. L'objectif long terme tant de maitriser les diffrentes techniques du NLP dans un but de moderniser et d'optimiser la chaine de production.`'",text data,Natural Language Processing Edisys,inClass,Classification automatique des avis d'appel public à la concurrence,meanfscore,natural-language-processing-edisys 772,'`Please refer to the 'Toxic Comment Classification Challenge ' Kaggle competition for details.`',text data,IIITB ML Project: Toxic comment classification,,,mcauc,iiitb-ml-project:-toxic-comment-classification 773,"'`Intro You have some experience with Python and Data Analysis basis. This contest is for students who have completed Intro in Data Analysis courses and are looking to improve their practical skills. Competition Description You are looking for a car, you have access to some data about cars from an auction site. Your task is to build a model that can estimate the price of a proposal (e.g. to say if it's reasonable or not). Lets investigate the features and make a regression model predict the price of the car. Practice Skills Feature engineering Regression techniques`'",tabular data,Cars from auction,inClass,,mape,cars-from-auction 774,'` .`',tabular data, ,,,categorizationaccuracy,- 775,"'`AI AcademyTime Series DataRobotDataRobot SubmissioncsvSubmit Predictions 2014-11-01 ~ 2014-12-31(storeid)(prodid)""Sales_qty"" Forum Rules 8/18() 23:59 1Private LeaderboardScore`'",time series,202007 AI Academy Time Series Assignment,inClass,Time series assignment for AI Academy students,mae,202007-ai-academy-time-series-assignment 776,"'`Welcome The Acea Group is one of the leading Italian multiutility operators. Listed on the Italian Stock Exchange since 1999, the company manages and develops water and electricity networks and environmental services. Acea is the foremost Italian operator in the water services sector supplying 9 million inhabitants in Lazio, Tuscany, Umbria, Molise, Campania. In this competition we will focus only on the water sector to help Acea Group preserve precious waterbodies. As it is easy to imagine, a water supply company struggles with the need to forecast the water level in a waterbody (water spring, lake, river, or aquifer) to handle daily consumption. During fall and winter waterbodies are refilled, but during spring and summer they start to drain. To help preserve the health of these waterbodies it is important to predict the most efficient water availability, in terms of level and water flow for each day of the year. Data The reality is that each waterbody has such unique characteristics that their attributes are not linked to each other. This analytics competition uses datasets that are completely independent from each other. However, it is critical to understand total availability in order to preserve water across the country. Each dataset represents a different kind of waterbody. As each waterbody is different from the other, the related features are also different. So, if for instance we consider a water spring we notice that its features are different from those of a lake. These variances are expected based upon the unique behavior and characteristics of each waterbody. The Acea Group deals with four different type of waterbodies: water springs, lakes, rivers and aquifers. Challenge Can you build a story to predict the amount of water in each unique waterbody? The challenge is to determine how features influence the water availability of each presented waterbody. To be more straightforward, gaining a better understanding of volumes, they will be able to ensure water availability for each time interval of the year. The time interval is defined as day/month depending on the available measures for each waterbody. Models should capture volumes for each waterbody(for instance, for a model working on a monthly interval a forecast over the month is expected). The desired outcome is a notebook that can generate four mathematical models, one for each category of waterbody (acquifers, water springs, river, lake) that might be applicable to each single waterbody. See the Submission Evaluation criteria.`'",tabular data,Acea Smart Water Analytics,,,water bodies,acea-smart-water-analytics 777,"'`Introduo Prezados Alunos, Bem vindos primeira etapa da primeira tarefa de PMR3508. Esta tarefa est dividada em duas etapas. A primeira consiste em participar desta competio fechada, em um ambiente seguro para que vocs se familiarizem com o Kaggle, com o Python e com suas ferramentas. A segunda parte, consiste em usar os dados da competio [Costa Rican Household Poverty Level Prediction](https://www.kaggle.com/c/costa-rican-household-poverty-prediction) e fazer um classificador, inserindo os resultados em um notebook. Procedimento O procedimento para execuo desta tarefa bem simples. Espera-se que cada aluno faa um notebook do Jupyter que contenha ao menos duas sees: Uma de explorao de dados, em que ele observa como so distribudas as variveis do dataset, quais valores elas assumem etc; a outra dedicada avaliao do impacto da seleo de variveis e feature engineering e seleo do parmetro K, do algoritmo K-Nearest Neighbors(KNN), sobre a acurcia do seu classificador. A submisso das suas previses para o set de teste opcional, mas encorajada. ATENO: A nota deste exerccio ser dada com base no KERNEL SUBMETIDO. Caso o aluno submeta um resultado para a leaderboard, mas no fornea o kernel. Instrues Na tab ""Data"" vocs podero encontrar os arquivos contendo os dados e suas respectivas descries. Aps a execuo de sua tarefa, salve seu notebook (em formato .ipynb) e clique na aba ""Kernels"". L, clique em ""new kernel"" e siga as instrues para submeter seu kernel na competio. Em seguida, proceda de igual forma para a segunda parte deste exerccio. Agradecimentos especiais ao Repositrio UCI por ter providenciado este dataset ao domnio pblico. Dicas 1) A documentao do pandas muito til! 2) O comando pd.describe() muito til para ter uma noo dos seus dados. 3) A biblioteca ScikitLearn possui vrios tutoriais interessantes em seu site ( http://scikit-learn.org/stable/index.html ). 4) Lembrem-se de fazer validao cruzada de seus resultados!!!!`'",tabular data,Adult-PMR3508,,,categorizationaccuracy,adult-pmr3508 778,"'`In this competition, you are asked to predict a rating of a book review from Amazon Kindle store. Evaluation metric: Mean Absolute Error. Deadline: 22th February, 23:59 (UTC) If you have any questions you are free to ask them in the Telegram chat.`'",text data,AI Community Innopolis #3,inClass,Contest 3: Kindle Reviews,mae,ai-community-innopolis-#3 779,"'` : 2018/10/13 9:30~2018/10/19 23:59 PM2.5 LASS, AirBox()2017/1/1~2017/1/30. 252(device)PM25_train.csv, : device_id : ID Date : Time : PM2.5 : PM2.5 PM10 : PM10 PM1 : PM1 Temperature : Humidity : lat : lon : ---------------------------------------------------------------------------------------------------- hub server data path: /data/examples/pm25 ---------------------------------------------------------------------------------------------------- : 2017/1/1~1/30,2522017/1/31PM2.5(by device ID). MSE(mean-square error), . Leaderboard baseline MSE 1.AirBox https://airbox.edimaxcloud.com/ 2.LASS http://lass-net.org/ 3.-g0v https://airmap.g0v.asper.tw/ 4. https://taqm.epa.gov.tw/taqm/tw/AqiForecast.aspx https://taqm.epa.gov.tw/taqm/tw/YearlyDataDownload.aspx 5. (:PM2.5) https://data.gov.tw/ 6.- 887 (2016-12-26) https://www.youtube.com/watch?v=qC8117PUajw`'",tabular data,AIA mid-term exam -PM2.5 forecast,inClass,台灣人工智慧學校 台中技術領袖班第一屆 期中考試-PM2.5預測,mae,aia-mid-term-exam--pm2.5-forecast 780,"'` @article{ahmed2016house, title={House price estimation from visual and textual features}, author={Ahmed, Eman and Moustafa, Mohamed}, journal={arXiv preprint arXiv:1609.08399}, year={2016} }`'",tabular data_image data,2020 3rd DataRobot AI Academy Deep Learning,,,mape,2020-3rd-datarobot-ai-academy-deep-learning 782,"'`Description Contest: AIF Challenge 1 Field: Image Processing Problem: Traffic Sign Classification Prize: $400,000 Prepared by: Dam Ba Quyen`'",image data,AIF Challenge 1 - Traffic Sign Classification,inClass,AI Forces Challenge,categorizationaccuracy,aif-challenge-1-traffic-sign-classification 783,"'`AILAB ML Training #2 Training Objectives: Learning how to handle table data Familiarizing yourself with pandas, lightgbm, neural networks, etc. What is Mercari Price Suggestion Challenge ? This is a product price prediction competition held by Mercari inc. in the past. (link)`'",tabular data,AILAB ML Training #2,,,rmsle,ailab-ml-training-#2 784,"'`If you joined this competition it means that you chose to work with the wine dataset, instead of the student dataset. Your goal now will be to take the training data and use it to train a Linear Regressor and KNN Regressor. After training, you need to make predictions and format your results similar to what is in the example solution file, and submit them. The final submission you make will be counted as your submission for this portion of the assignment. Test and training data can be read in using the code below. Similar to your in-class assignments, make sure the data files are in the same folder as your notebook file. data_train_df = pd.read_csv(""TrainingFileName.csv"") data_train_ft = data_train_df.drop('Class', axis=1) data_train_tgt = data_train_df[""Class""] To format your solution correctly you can use the function below. It will take your prediction data and write it to CSV file, which you can then submit as your solution. //predictions should be the result returned by modelName.fit(test_features) def writeSubmission(predictions): i=1 submissionList = [] for prediction in predictions: submissionList.append([str(i), str(prediction)]) i+=1 with open('submission.csv', 'w', newline='') as submission: writer = csv.writer(submission) writer.writerow(['Id', 'Predicted']) for row in submissionList: writer.writerow(row)`'",,AIML - Wine Quality Dataset,,,rmse,aiml-wine-quality-dataset 785,"'`If you joined this competition it means that you chose to work with the Zoo dataset, instead of the Adult dataset. Your goal now will be to take the training data and use it to train a KNN and Naive Bayes Classifiers. After training, you need make predictions and then format your results similar to what is in the example solution file, and submit them. The final submission you make will be counted as your submission for this portion of the assignment. Test and training data can be read in using the code below. Similar to your in-class assignments, make sure the data files are in the same folder as your notebook file. data_train_df = pd.read_csv(""TrainingFileName.csv"") data_train_ft = data_train_df.drop('Class', axis=1) data_train_tgt = data_train_df[""Class""] To format your solution correctly you can use the function below. It will take your prediction data and write it to CSV file, which you can then submit as your solution. //predictions should be the result returned by modelName.fit(test_features) def writeSubmission(predictions): i=1 submissionList = [] for prediction in predictions: submissionList.append([str(i), str(prediction)]) i+=1 with open('submission.csv', 'w', newline='') as submission: writer = csv.writer(submission) writer.writerow(['Id', 'Category']) for row in submissionList: writer.writerow(row)`'",,AIML - Zoo Dataset,,,categorizationaccuracy,aiml-zoo-dataset 786,"'`The objective of this competition is to classify images from Airy properties. There are 10 main categories in this dataset. The labels are not designed to be balanced. Thus, it's up to you on how to handle that problem. Beware, you might find some duplicated images as well. As a first step, try building a classifier that uses the simplest transfer learning method. Next, try exploring more advanced and intricate models as well. You might also find data augmentation to be useful in this case. Finally, examine the errors you're making and see what you can do to improve. You are allowed to form a team. However, the prize amount will be the same. So, you need to find a way to divide the prize if your team wins!`'",image data,Airy Photo Classification Challenge,inClass,Can you tell which part of a hotel are these images?,meanfscore,airy-photo-classification-challenge 787,"'`Submissions are evaluated on area under the ROC curve between the predicted probability and the observed label. Submission File For each TZ in the test set, you must predict a probability for the NESHER column. The file should contain a header and have the following format: TZ ,NESHER b'\xcf\x14t\x12\xafK\x11\xf9\x19 b'U@\x06z\x19%\xb1\x98\x9d\xba~\xdf,?(=',0.5 b'7\xfd\x89\xc4\xdbsl&\x08\xa7\xde\xcd\x95\xeeZ\x17',0.5 etc.`'",,"""",,,auc,"""" 788,"'`Welcome to HACK the WAVE: Leveraging Data Science to help fight ALS. TWEET: #hackthewave Join our slack channel www.hackthewaveals.slack.com Problem Amyotrophic Lateral Sclerosis (ALS), also known as Lou Gehrig's dis-ease is a progressive neurodegenerative disease that affects nerve cells in the brain and the spinal cord. Once diagnosed, the average life expectancy is 2-5 years in which time you lose the ability to walk, talk, eat and breathe. Currently, there is no cure and there is only one FDA approved drug which modestly extends survival. ALS is a disease that knows no barriersit can affect anyone, of any age, ethnicity, socioeconomic background or gender. The only known link to ALS is military serviceservice members or veterans are more than twice as likely as the general population to develop the disease. ALS Association Golden West Chapter The ALS Association Golden West Chapter is dedicated to the fight against ALS in many ways, including funding global research efforts, supporting scientific and clinical collaboration, connecting people with ALS to clinical trials, partnering with multidisciplinary ALS clinics and centers, educating the public about ALS, providing professional care management services to families facing ALS, pursuing important public policy initiatives, and bringing the ALS community together. Opportunity Booz Allen Hamiltons Honolulu Office and Women in Data Science (WiDS) strive to empower Booz Allen community members (fellow colleagues, family members, friends) who have been affected by ALS. The aim of this project is to increase awareness of the disease while exploring how to further the research and advocacy efforts lead by the ALS Association. Where, When, and Who Date, Time & Location: Friday, September 20 Sunday, September 22 at the Punahou School in Honolulu, and the Entrepreneurs Sandbox Format: Opening reception, keynote speakers, two days of competition, closing ceremony This FREE 3-day event includes people like YOU who are interested in creating unique design-thinking solutions to generate new ideas to battle ALS. This 3-day event is FREE and open to the community, from professionals to students to amateur data scientists and even those who have been affected by ALS. Everyone is welcome regardless of coding experience or knowledge of ALS. Punahou School faculty will be providing instruction and support on design, prototyping and coding. Novices will learn and apply basic coding and design-thinking skills. Hack-a-thon Details This is the description page for the HACKtheWAVE, brought to you by Booz Allen Hamilton and Punahou School! Join us and other data science and design-thinking enthusiasts in tackling one of four different challenges: Natural Language Processing Track: Bibliometrics -- analyzing published research results to inform research funding decisions. https://www.kaggle.com/c/alsa/overview/track-1-natural-language-processing-of-scientific-lit Data Visualization Track: Creating data-powered visual imagery to communicate messages about ALS clinical trials. https://www.kaggle.com/c/alsa/overview/track-2-visualization Bioinformatics Track: Predict the progression of the disease. Bioinformatics Challenge Data Sets: this is ALS genomic data from ALS patients, also known as the PRO- ACT data set. https://www.kaggle.com/c/alsa/overview/track-3-bioinformatics App Design Track(saturday am only): Clinical Trials to Defeat ALS Design an app and site for the public to track the enrollment, inclusion/exclusion criteria, and phases of active clinical trials, and provide visualizations to get people excited about tracking the journey the way you would with a walk-a-thon team. https://www.kaggle.com/c/alsa/overview/track-4-app-design`'",,HacktheWAVE ALS,inClass,Use the power of data to assist the ALS Association,rmse,hackthewave-als 789,'`Let's try to predict the price of a bottle of wine based on a collection of over one hundred thousand reviews and other product features.`',text data,AMMI Ghana Bootcamp Kaggle competition,,Predict the price of a bottle of wine based on a collection of over one hundred thousand reviews and other product features.,rmse,ammi-ghana-bootcamp-kaggle-competition 790,"'`El problema del vendedor viajero, problema del vendedor ambulante, problema del agente viajero o problema del viajante (TSP por sus siglas en ingls (Travelling Salesman Problem)), responde a la siguiente pregunta: dada una lista de ciudades y las distancias entre cada par de ellas, cul es la ruta ms corta posible que visita cada ciudad exactamente una vez y al finalizar regresa a la ciudad origen? Este es un problema NP-Hard dentro del campo de la optimizacin combinatoria, muy importante en la investigacin de operaciones y en la ciencia de la computacin. El objetivo de esta competicin es, a partir de 76 ficheros de datos, obtener los valores ptimos para cada uno de ellos. Cada fichero de datos representa un problema TSP diferente, con ciudades a visitar distintas. El mtodo de evaluacin es el error cuadrtico medio sobre el valor ptimo. El problema fue formulado por primera vez en 1930 y es uno de los problemas de optimizacin ms estudiados. Es usado como prueba para muchos mtodos de optimizacin. Aunque el problema es computacionalmente complejo, una gran cantidad de heursticas y mtodos exactos son conocidos, de manera que, algunas instancias desde cien hasta miles de ciudades pueden ser resueltas. El TSP tiene diversas aplicaciones an en su formulacin ms simple, tales como: la planificacin, la logstica y en la fabricacin de circuitos electrnicos. Un poco modificado, aparece como: un sub-problema en muchas reas, como en la secuencia de ADN. En esta aplicacin, el concepto de ciudad representa, por ejemplo: clientes, puntos de soldadura o fragmentos de ADN y el concepto de distancia representa el tiempo de viaje o costo, o una medida de similitud entre los fragmentos de ADN. En muchas aplicaciones, restricciones adicionales como el lmite de recurso o las ventanas de tiempo hacen el problema considerablemente difcil. El TSP es un caso especial de los Problemas del Comprador Viajante (travelling purchaser problem). En la teora de la complejidad computacional, la versin de decisin del TSP (donde, dado un largo L, la tarea es decidir qu grafo tiene un camino menor que L) pertenece a la clase de los problemas NP-completos. Por tanto, es probable que en el caso peor el tiempo de ejecucin para cualquier algoritmo que resuelva el TSP aumente de forma exponencial con respecto al nmero de ciudades.`'",tabular data,TSP Algoritmos y Programacin 2020,,,rmse,tsp-algoritmos-y-programacin-2020 791,"'`CSE5311 Sound Classification Homework 5 wav Classification csv Data set wav . Channels : 1 Sample Rate : 44100 Precision : 16-bit Duration : 00:00:05.00 = 220500 samples Sample Encoding: 16-bit Signed Integer PCM`'",audio, HW,,,categorizationaccuracy,-hw 792,"'` , , 1 : 3 2-3 : `'",,Hello World,,,rmsle,hello-world 793,"'`Steps to Complete This Assignment Each member in your team will create a new Kaggle account using his/her terpmail.umd.edu email address (if you don't already have one). One member from your team will go to the Team tab on this page, Save Team Name as your team name on ELMS, e.g. T0-ai girls, other members will have to join the competition and Save Team Name as his/her terpmail.umd.edu email address. In the same Team tab, under the section Invite Others Merge with other teams or invite users to your team by their team name, add the Kaggle accounts of other members using his/her terpmail.umd.edu address to your team, other members will have to accept the merge request. Select the Notebooks tab, create a new notebook and set the notebook name as your team name on ELMS, e.g. T0-ai girls. In the notebook, select the Share the button on the top right corner, set the privacy setting to Public, and then add each team member as Collaborators. Use the notebook to run the code for prediction, generate the prediction file, and visualize the predictions. In the notebook, select Save Version on the top right corner, and select Save & Run All (Commit), this will be the version of your final submission. In the notebook, select the version number next to the Save Version button, select Go to Viewer, scroll down to the Output Files section, select Submit to Competition. If done correctly, your submission should appear on the Leaderboard. How Your Grade Will Be Determined Many of the prompts have no wrong answer. You will be given full credit based on effort. Submit Kaggle notebook with visualization: 4 points Submit prediction file from Kaggle notebook: 1 point Submissions You must make submissions directly from your team's public Kaggle notebook. If done correctly, you should see your notebook and your submission score publicly by going to the Notebooks tab. Visualization Format Using your predicted id and bounding box labels, display the bounding boxes with person's name on the images provided. Visualization example: Submission Format For each image in the Data, you must predict the vectors of (id, xmin, xmax, ymin, ymax). Each field is described as: id: string type, concatenate the image_id and person_id, e.g. 1.jpg & Raymond Tu -> 1_25 image_id values are based on the filename without the file extension person_id values are based on the person_id_mapping.csv provided xmin: float type, left pixel value of the bounding box xmax: float type, right pixel value of the bounding box ymin: float type, top pixel value of the bounding box ymax float type, bottom pixel value of the bounding box All bounding box values are based on pixel coordinates of the provided images. All predictions must be submitted as a file in CSV format. An example of the submission file is provided in sample_submission.csv. Evaluation In this competition, submissions are evaluated by computing the Root-mean-squared Error (RMSE) of all matches. A match is defined as: the prediction bounding box has an id label that exists in the ground truth. The final RMSE is computed as the normalized distance between the vectors of predicted bounding box values and the vectors of ground truth bounding box values, i.e. RMSE(xmin, xmax, ymin, ymax) for all matches. A lower RMSE value means better prediction. Your team will be ranked on this final metric. Additional Information Late submissions will not be accepted for this assignment. You are responsible for creating a valid submission. Zero credit will be given for this assignment if your notebook or submission file cannot be opened on Kaggle.`'",image data,ASN10e Final Submission - Detect COML Faces,inClass,,rmse,asn10e-final-submission-detect-coml-faces 794,"'`ImageNet is a well known dataset with 1000 image classes. We will be working on a subset of the dataset (60k images, 100 classes, 600 images per class 8080 pixels, RGB) and train a model to classify an image into one of the 100 classes. The dataset is located under the data directory in the assignment zip file or available to download here. Training and validation data splits are under data/train and data/val directories respectively. Both splits consist of 100 directories, each representing an object category. The test set doesn't contain labels. You should submit a csv file with you label predictions. Use the filename (without the extension) of the test image as an image id. Example: id,label 1,29 2,40 3,99 ... 29093,4 UPDATE: Added a notebook with an example how to convert predictions to a csv file in notebooks tab`'",image data,ATML2020 Assignment 2,,,categorizationaccuracy,atml2020-assignment-2 795,"'`Zapraszamy do konkursu, ktry nie jest konkursem. Chcemy udostpni druynom zaangaowanym w projekt Automodele rodowisko, ktre bdzie pozwalao korzysta z narzdzi analitycznych i udostpnionych zbiorw danych, a przy tym bdzie oferowao forum umoliwiajce dyskusje oraz przegld rozwiza rozwijanych przez inne osoby ze rodowiska Data Science. Krok pierwszy w naszym projekcie to uruchomienie podstawowych modeli na podstawowym zbiorze danych, ktrym jest zbir Iris - to jest wanie zadanie postawione przed uczestnikami tego ""konkursu"".`'",tabular data,Automodele,,,categorizationaccuracy,automodele 796,"'`Project Overview This competition is the final project for BADM 211 at the Gies College of Business, University of Illinois at Urbana-Champaign. The data for this competition and prize money has been generously provided by IRI Worldwide, a Chicago-based market research company. Every year, trillions of gigabytes of data are produced by normal business operations, but most of that data is not used in any meaningful way. However, given what you have learned in BADM 211 over the course of this semester, you now have the skills to change that. The dataset provided here is an example of the data that we generate in our daily lives: the grocery spending habits of people in different parts of the United States. A sample of households from Eau Claire Wisconsin, and Pittsfield, Massachusetts, participated in a data collection effort by IRI. In this competition, you will demonstrate that such data are not only of interest to each store's accounting department but to a broader range of business decision-makers. Prediction Task Your client is a grocery store that needs help deciding whether to add salty snacks to its product offerings. They will make this decision based on how much annual revenue salty snacks will be expected to generate. Your task is to assist the store in making this decision by predicting the revenue from salty snacks, using data on consumer spending habits from a previous year. You will build and train a predictive model using data on consumer spending from 2010. The data will be split into training and test datasets. You will predict spending on salty snacks for a panel of consumers representative of the store's market. The store will then use this to estimate the total revenue for the market and, ultimately, decide whether to add this category. Data Overview Luckily for you, there is a wide range of data for you to work with, including demographic and sales data, which you can use to construct your predictive model and help the store make a data-driven business decision. You have some Demographic and Household variables such as Race, Education, Renter vs Owner, Family Size, Number of Cats and their annual spending across various categories, e.g. yogurt, household cleaning, salty snacks and more (31 categories total). A complete overview of the data can be found on the Data page. Similar Challenges There are similar competitions on Kaggle such as: Rossmann Store Sales link. House Price Prediction link You can use public notebooks to get ideas about EDA as well as different types of algorithms different people use for similar problems. You are free to recycle code from those public resources with proper citation.`'",tabular data,BADM 211 Final Project,,,rmse,badm-211-final-project 797,"'` : - , - . , . , (1 ) (0 ).`'",tabular data,Bank issues competition,inClass,You will need to predict if clients will be loyal to the bank,meanfscore,bank-issues-competition 798,"'`Welcome to the Bluecap Fraud (Updated) competition. A scoring model for individuals must be made on the variable ""FRAUDE"" that takes values 0 (no fraud) or 1 (fraud, which is the no payment of one of the 3 first instalments from the data observation point). Each column represents a characteristic of the person and has to be cleaned and treated (there are missing values). The output will be a fraud probability (it does not need to be calibrated, only the sorting matters).`'",tabular data,BC Fraud (Updated),inClass,Create a Fraud Score,auc,bc-fraud-(updated) 799,"'`Classify news articles to automate an email newsletter Build a model capable of performing a fine selection of news articles with the goal of generating a customized newsletter every day. Thousands of online articles have been downloaded over the last two years and those included in previous newsletter have been labeled. For a new set of articles, the model should be trained to accurately decide which articles should be included in the newsletter. The train dataset should be used to fit your classifier and predictions should be made on the test dataset. Your predictions are submitted and Kaggle automatically calculates your score based on the AUC criterion.`'",text data,bcpnews,inClass,Classify news articles,auc,bcpnews 800,"'`Objetivo El objetivo de esta competicin es poner en prctica algunos de los conceptos tratados en el curso de ""Modelizacin estadstica"". Funcionamiento de la competicin Para poder participar en la competicin, se debe seguir los siguientes pasos: Aceptar las normas de participacin (botn azul en la parte derecha, bajo el cuadro con el ttulo de la competicin) Consultar los ejemplos que se irn colgando. Para ello, ir al apartado Notebooks y ver las notebooks pblicas colgadas. El resultado de estas notebooks se utilizar como benchmark en la leaderboard Entrenar un modelo y hacer predicciones. Para ello, se puede realizar un modelo en local (instalar Python, los packages necesarios, etc.) o crear una notebook privada (recomendado, ya que ser ms fcil compartir los resultados e instalar los packages necesarios) Subir los resultados del modelo en forma de submission. Para ello, clickar en el botn submit predictions en la parte derecha de la pantalla, debajo del cuadro con el ttulo de la competicin Para cualquier duda, podis colgar un post en el apartado Discussion (preferible, ya que as otros compaeros pordn consultar las dudas que tengan) o contactarme directamente por mail / chat.`'",tabular data,BCU Ratings 2020,,,auc,bcu-ratings-2020 801,"'`This dataset has been collected during an experiment on a packaging production line. The chain of the production line has been monitored on a normal mode on the first part of the dataset. Then, the chain has been stretched out while the production kept being monitored. The data were aggregated at the minutes using the median. The column flag is used to distinguished the two periods. It equals 1 when the chain of the production line is tensed and 0 when the chain is loose. The aims of this experiment were to detect the anomalies resulting from the stretching of the chain. Columns Descrption timeindex- Number of seconds since the beginning of the experiment flag- Indicates the normal period (1) and the suspected anomalous period (0) currentBack-Current of the rear motor motorTempBack-TemperapositionBack positionBack- Position of the rear chain refPositionBack- Reference position of the rear chain refVelocityBack- Reference velocity of the rear chain trackingDeviationBack-Tracking deviation of the rear chain velocityBack-Velocity of the rear chain currentFront- Current of the front motor\ motorTempFront- Temperature of the front motor positionFront- Position of the front chain refPositionFront-Reference position of the front chain refVelocityFront- Reference velocity of the front chain trackingDeviationFront - Tracking deviation of the front chain velocityFront - Velocity of the front chain Acknowledgements Schneider Electric Exchange for providing this dataset.<`'",tabular data,BDA 2019 ML Test -Packaging anomaly detection,,,f_{beta},bda-2019-ml-test--packaging-anomaly-detection 802,'`Insert text here. Please use markdown.`',text data,bh8tL99mh7T8dvj,inClass,[200130] Pingpong AI Research - 박채훈님,categorizationaccuracy,bh8tl99mh7t8dvj 803,"'`Mini Data Competition The scoring metric is RMSE No outside data allowed`'",tabular data,Big Vehicle Peasants,,,rmse,big-vehicle-peasants 804,'` .`',tabular data,Sadang Pancake House,inClass,Ohoh it is mat jip,rmse,sadang-pancake-house 805,"'`Competition Introduction , . . . competition `'",tabular data, UI Competition 1th,,,auc,-ui--competition-1th 806,'` ?`',image data,BigData Team | ?,,,auc,bigdata-team-|----? 807,"'`Competition Introduction , . . . competition `'",tabular data,BigData UI Class Competition 1th,inClass,,auc,bigdata-ui-class-competition-1th 808,"'`Competition Description The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. Practice Skills Binary classification Python and R basics`'",tabular data,BI Insights Team - Titanic Challenge,inClass,Titanic Challenge ,categorizationaccuracy,bi-insights-team-titanic-challenge 809,"'`BikeSharingDSG test BikeDemand2011120126train, 20127~12testtest50%public,private 2020/8/7() 13:00 () submit 120()2`'",tabular data,Bike Sharing Demand for Education,,,rmsle,bike-sharing-demand-for-education 810,"'`Mission Statement You have been given seven smart agents, each doing a specific task. While the task details are confidential, you have been given 94 parameters describing the state of each agent. These parameters are associated with the label which corresponds to the agent's performance on the confidential task. Observations have been taken across time and each observation corresponds to an entry in the train.csv file. Your mission (should you choose to accept it) is to develop a model that is able to predict the agent's performance (ie label) given the state parameters, agent id and the time of observation. About the Competition The competition will last for a week. You will be given 6 days to develop a model using only the training data. On the seventh day, the test set will be released, and the leaderboards would be opened. You would thus have 24 hours to validate your model that was built over the past six days. After that, you would have 12 hours to submit the code for your best performing model as a Kaggle kernel. Make sure you submit a minimal working code sample, that is - include only code directly related to the submission. Leave out data visualization. Also, make sure that your kernel works on Kaggle with the output for each cell shown clearly and generates an output .csv file for evaluation. You will have a cap of five submissions per day. You are allowed to choose two submissions for the final evaluation out of which the best performing will be considered. These two submissions must be chosen before the competition deadline. This competition is Evaluative. About using multiple accounts Kaggle has methods of tracking multiple related accounts submitting to a single contest. In case your account(s) get blocked, you're on your own. About allowed methods Refer to the competition rules. Updates [ 23rd March - 06:00 PM ] While the description of an agent and the notion of optimization may hint towards Reinforcement Learning, the problem was originally designed as a regression problem. Make sure you try the simple approaches first. [ 25rd March - 03:30 PM ] Some clarifications about the test set: The test dataset would be a shifted version of the train set (without the labels obviously). Thus if the train set corresponds to the first eleven months, then the test set would correspond to the twelfth month. All entries would be present for each agent sorted according to the time feature. [28th March - 12:30 PM] You can assume that the time values are equally spaced. [28th March - 11:59 PM] There have been some changes in the timeline. Check the Timeline section for more details.`'",tabular data,BITS-F464-L1 : Predict Agent Performance,,,rmse,bits-f464-l1-:-predict-agent-performance 811,"'`This is the memorable v. not memorable binary classification competition for G6061 Undergraduate and 934G5 Postgraduate Machine Learning Module spring teaching 2019/2020 at the University of Sussex, UK. Please make an account with your University of Sussex email ID. You are provided with 247 labelled training data (209 of memorable scenes and 38 of not memorable scenes), and 11874 of test data, which are not labelled. The task is to develop a binary-class classifier that predicts the labels for the test data set. Each data instance is represented as a 4608 dimensional feature vector. This vector is a concatenation of 4096 dimensional deep Convolutional Neural Networks (CNNs) features extracted from the fc7 activation layer of CaffeNet and 512 dimensional GIST features (this representation is given therefore you do not need to perform any feature extraction on images). Our training data were collected in Brighton, while test data were collected in London. We have a domain adaptation problem! We also know that there are many more memorable sceneries in Brighton than in London. Additionally, you are also provided with three types of information that might be useful when building your classifier: a) additional 2219 labelled training data which is incomplete as it has missing feature values, b) confidence of the label annotation for each training data point (247 labelled training data and additional but incomplete 2219 labelled training data), and c) the proportion of positive (memorable) data points and the proportion of not memorable data points in the test set. You can choose to incorporate or to ignore these additional data. You can use any of your favourite classifiers. In this module, we have discussed: perceptron, multi-layer perceptron, random forest (G6061), RBF networks (934G5 PG), support vector machine, and logistic regression. There are 2 leaderboards - one public that is 25% of the test data and one private that is the other 75%. Public-private splits are done randomly. Public leaderboard will be used to evaluate your current score and ranking till deadline date. The final rankings will be based on the private leaderboard - so make sure you do not overfit your model to the public leader board. Keep in mind that you can upload multiple prediction files; your only limit is that at most 5 prediction files can be uploaded per day. You can select up to 2 final submissions for judging. You have to make at least one submission to this competition! Format of the solution file you submit should be same as the file samplevalidsubmission.csv (i.e. 2 columns with 1st column as ID & 2nd column as prediction). With each submission, please write a brief description of the model (e.g. logistic regression with the regularisation parameter=10). Timing: - This competition will be closed on Friday 30 June 2020 11:59PM. - Check our e-submission system for the deadline of the report.`'",image data,"Brighton, a memorable city!",,,categorizationaccuracy,"brighton,-a-memorable-city!" 812,"'` . . 25,000 . , . , , , . , , 18 . : -5 (50). 6 , ( LB), 49 1 ; . , , , 1 . 0. , ( Notebooks / Discussion). , Notebooks ( 30 - LB, ). GridSearch , : ( , , , ) (EDA, , , ) , , . , . ( - torchvision ). , , - ? ! Evaluation metric: Acknowledgements We thank Daniel Lysukhin for providing this dataset.`'",image data,Car plates OCR,,,levenshteinmean,car-plates-ocr 813,"'` . - , . . : id - city - price - ( ) brand - model - drive_type - engine_summary - owner_type - generation - year - color - body - gear_type - wheel_type - state - is_new - doors_count - mileage - ownersbypts - MAPE`'",tabular data,Car Price Modelling,,,mape,car-price-modelling 814,"'`Problem Statement: Given the Product Details, Customer Details and the Transaction Details spanning over 7 quarters, use these to predict what would the customer buy next. The dataset contains transaction details of 10K customers. Therefore, the prediction file should contain product recommendations for these 10K customers. Churn Prediction Hackathon`'",tabular data,Cartesian Super Data Bowl 2019,inClass,Build a Recommender System! ,map@{k},cartesian-super-data-bowl-2019 815,"'`Diese Competition ist fr den Corona bedingten Online-Teil des Machine Learning Labs im Rahmen des CAS Data Science der FHNW im Frhlingssemester 2020. Aufgabe Wir haben hier einen Datensatz bereitgestellt, welcher Informationen von 22'570 Liegenschaften in der Schweiz beinhaltet. Ziel ist es fr die 2257 Liegenschaften im Testset den Preis zu schtzen. Mgliches Vorgehen Daten einlesen Daten in Trainings- und Development-Set teilen Daten subern Daten analysieren Korrelationsmatrix Koordinaten Outliers identifizieren Einfaches Modell trainieren um Preis zu schtzen (z.B. Lineare Regression) Modell auf Development-Set evaluieren Preise auf Test-Set schtzen und Submission hinaufladen Weitere Modelle trainieren und evaluieren Gradient Boosting Verschiedene Modelle fr die verschiedenen Objekttypen trainieren Ensemble mehrere Modelle trainieren Feature Engineering Features transformieren Features kombinieren neue Features suchen (z.B. Steuerfuss der Gemeinde, Geo Daten) Hyperparameter optimieren`'",tabular data,Machine Learning Lab - CAS Data Science FS 20,inClass,FHNW Edition,mape,machine-learning-lab-cas-data-science-fs-20 816,"'`(Russian version below) Web-user identification is a hot research topic on the brink of sequential pattern mining and behavioral psychology. Here we try to identify a user on the Internet tracking his/her sequence of attended Web pages. The algorithm to be built will take a webpage session (a sequence of webpages attended consequently by the same person) and predict whether it belongs to Alice or somebody else. The data comes from Blaise Pascal University proxy servers. Paper ""A Tool for Classification of Sequential Data"" by Giacomo Kahn, Yannick Loiseau and Olivier Raynaud. . . , . , , : , , - - . """" , SMS-. . , , Google Analytics , ""Traversal Pattern Mining"" ""Sequential Pattern Mining"". : -, , , (- ). - . ""A Tool for Classification of Sequential Data"", Giacomo Kahn, Yannick Loiseau Olivier Raynaud.`'",tabular data,"Catch Me If You Can (""Alice"")",,,auc,"catch-me-if-you-can-(""alice"")" 817,"'`Rules Update: The CDiscount team has updated their rules to allow for use of this dataset for research and academic purposes only. To access the data, go to rules and accept the terms to download the data. Cdiscount.com generated nearly 3 billion euros last year, making it Frances largest non-food e-commerce company. While the company already sells everything from TVs to trampolines, the list of products is still rapidly growing. By the end of this year, Cdiscount.com will have over 30 million products up for sale. This is up from 10 million products only 2 years ago. Ensuring that so many products are well classified is a challenging task. Currently, Cdiscount.com applies machine learning algorithms to the text description of the products in order to automatically predict their category. As these methods now seem close to their maximum potential, Cdiscount.com believes that the next quantitative improvement will be driven by the application of data science techniques to images. In this challenge you will be building a model that automatically classifies the products based on their images. As a quick tour of Cdiscount.com's website can confirm, one product can have one or several images. The data set Cdiscount.com is making available is unique and characterized by superlative numbers in several ways: Almost 9 million products: half of the current catalogue More than 15 million images at 180x180 resolution More than 5000 categories: yes this is quite an extreme multi-class classification!`'",image data,Cdiscounts Image Classification Challenge,,,categorizationaccuracy,cdiscounts-image-classification-challenge 818,"'`The goal of the competition is to predict cell phone ratings based on their reviews. A cell phone can be rated from 1 to 5. The student's task is to create an algorithm that predicts one of these values based mainly on raw text data. The submissions will be evaluated using Root Mean Squared Error (RMSE). To create a predictive model, students should use Python and scikit-learn. In addition to submitting their solution on this site, students are required to provide a link to reproducible code in the form of a Jupyter Notebook. Project and submission deadline: 19.01.2019`'",text data,Cell Me,inClass,Predict the ratings of mobile phones,rmse,cell-me 819,"'`Competition description The goal of this competition is the prediction of the price of diamonds based on their characteristics (weight, color, quality of cut, etc.). This is an academic competition created for the students of CEUPE Big Data & Analytics courses.`'",tabular data,CEUPE Big Data & Analytics,,,rmse,ceupe-big-data-&-analytics 820,'` .`',tabular data, : ,,,auc,-:-- 821,"'`Welcome Willkommen bei Ihrer persnlichen Data Analytics Challenge. Viel Erfolg und vor allen Dingen viel Spa! Aufgabenstellung Sie schlpfen in die Rolle eines Data Analysten der MedBank. Einer auf den Gesundheitsmarkt spezialisierten Genossenschaftsbank. Sie haben soeben Ihr neues Projekt und den zugehrigen, Ihnen unbekannten, Datensatz erhalten. Ihre Aufgabe ist es, ein mglichst gutes Prognosemodell fr eine Marketingkampagne fr eine neues Produkt ""MedTrust"" zu erstellen und die Abschlusswahrscheinlichkeiten vorauszusagen. Analysieren Sie die hierfr den Trainingsdatensatz. Fhren Sie zunchst eine explorative Analysen durch: Fr welche der erklrenden Variablen sind Zusammenhnge zur Zielvariablen zu beobachten? Welche sind fachlich sinnvoll? Treffen Sie eine Auswahl von erklrenden Variablen und erstellen Sie anschlieend verschiedene Vorhersagemodelle (logistische Regression, Entscheidungsbume sowie Random Forests). Erstellen Sie je Verfahren verschiedene Modelle mit unterschiedlichen Parametern. Gehen Sie hierbei mglichst systematisch vor, so dass Sie spter die Performance des Modells (=AUC) in Abhngigkeit von den Parametern beschreiben knnen. Wenden Sie dieses Modell an, um eine Prognose den Testdaten vorzunehmen.`'",tabular data,Challenge GH,inClass,Challenge GH,auc,challenge-gh 822,"'`Welcome Willkommen bei Ihrer persnlichen Data Analytics Challenge. Viel Erfolg und vor allen Dingen viel Spa! Aufgabenstellung Sie schlpfen in die Rolle eines Data Analysten der MedBank. Einer auf den Gesundheitsmarkt spezialisierten Genossenschaftsbank. Sie haben soeben Ihr neues Projekt und den zugehrigen, Ihnen unbekannten, Datensatz erhalten. Ihre Aufgabe ist es, ein mglichst gutes Prognosemodell fr eine Marketingkampagne fr eine neues Produkt ""MedTrust"" zu erstellen und die Abschlusswahrscheinlichkeiten vorauszusagen. Analysieren Sie die hierfr den Trainingsdatensatz. Fhren Sie zunchst eine explorative Analysen durch: Fr welche der erklrenden Variablen sind Zusammenhnge zur Zielvariablen zu beobachten? Welche sind fachlich sinnvoll? Treffen Sie eine Auswahl von erklrenden Variablen und erstellen Sie anschlieend verschiedene Vorhersagemodelle (logistische Regression, Entscheidungsbume sowie Random Forests). Erstellen Sie je Verfahren verschiedene Modelle mit unterschiedlichen Parametern. Gehen Sie hierbei mglichst systematisch vor, so dass Sie spter die Performance des Modells (=AUC) in Abhngigkeit von den Parametern beschreiben knnen. Wenden Sie dieses Modell an, um eine Prognose den Testdaten vorzunehmen.`'",tabular data,Challenge SM,inClass,Challenge SM,auc,challenge-sm 823,"'`Welcome Willkommen bei Ihrer persnlichen Data Analytics Challenge. Viel Erfolg und vor allen Dingen viel Spa! Aufgabenstellung Sie schlpfen in die Rolle eines Data Analysten der MedBank. Einer auf den Gesundheitsmarkt spezialisierten Genossenschaftsbank. Sie haben soeben Ihr neues Projekt und den zugehrigen, Ihnen unbekannten, Datensatz erhalten. Ihre Aufgabe ist es, ein mglichst gutes Prognosemodell fr eine Marketingkampagne fr eine neues Produkt ""MedTrust"" zu erstellen und die Abschlusswahrscheinlichkeiten vorauszusagen. Analysieren Sie die hierfr den Trainingsdatensatz. Fhren Sie zunchst eine explorative Analysen durch: Fr welche der erklrenden Variablen sind Zusammenhnge zur Zielvariablen zu beobachten? Welche sind fachlich sinnvoll? Treffen Sie eine Auswahl von erklrenden Variablen und erstellen Sie anschlieend verschiedene Vorhersagemodelle (logistische Regression, Entscheidungsbume sowie Random Forests). Erstellen Sie je Verfahren verschiedene Modelle mit unterschiedlichen Parametern. Gehen Sie hierbei mglichst systematisch vor, so dass Sie spter die Performance des Modells (=AUC) in Abhngigkeit von den Parametern beschreiben knnen. Wenden Sie dieses Modell an, um eine Prognose den Testdaten vorzunehmen.`'",tabular data,Challenge ARO,inClass,Challenge ARO,auc,challenge-aro 824,"'`Welcome Willkommen bei Ihrer persnlichen Data Analytics Challenge. Viel Erfolg und vor allen Dingen viel Spa! Aufgabenstellung Sie schlpfen in die Rolle eines Data Analysten der MedBank. Einer auf den Gesundheitsmarkt spezialisierten Genossenschaftsbank. Sie haben soeben Ihr neues Projekt und den zugehrigen, Ihnen unbekannten, Datensatz erhalten. Ihre Aufgabe ist es, ein mglichst gutes Prognosemodell fr eine Marketingkampagne fr eine neues Produkt ""MedTrust"" zu erstellen und die Abschlusswahrscheinlichkeiten vorauszusagen. Analysieren Sie die hierfr den Trainingsdatensatz. Fhren Sie zunchst eine explorative Analysen durch: Fr welche der erklrenden Variablen sind Zusammenhnge zur Zielvariablen zu beobachten? Welche sind fachlich sinnvoll? Treffen Sie eine Auswahl von erklrenden Variablen und erstellen Sie anschlieend verschiedene Vorhersagemodelle (logistische Regression, Entscheidungsbume sowie Random Forests). Erstellen Sie je Verfahren verschiedene Modelle mit unterschiedlichen Parametern. Gehen Sie hierbei mglichst systematisch vor, so dass Sie spter die Performance des Modells (=AUC) in Abhngigkeit von den Parametern beschreiben knnen. Wenden Sie dieses Modell an, um eine Prognose den Testdaten vorzunehmen.`'",tabular data,Challenge DO,inClass,Challenge DO,auc,challenge-do 825,"'`O objetivo utilizar aprendizado no-supervisionado para criar um modelo capaz de identificar se determinada entrada faz parte ou no da distribuio dos dados, ou seja, se uma anomalia. Neste desafio utilizaremos imagens de raio-x de pulmo ""tirados"" de frente (PA) e de costas (AP) e o modelo dever indicar se determinada imagem um desses tipos de raio-x. Como treinamento utilizaremos o seguinte dataset https://www.kaggle.com/c/rsna-pneumonia-detection-challenge Como teste para determinar as mtricas do modelo criamos no kaggle um desafio contendo imagens normais e anmalas (que no sejam do tipo PA ou AP, incluindo raio-x lateral) Referncia https://docs.seldon.io/projects/alibi-detect/en/stable/overview/algorithms.html`'",image data,Chest XRay Anomaly Detection,,,f_{beta},chest-xray-anomaly-detection 826,"'`This competition is for students following CO 544 course. Remember to set your team name same as the one you registered in FEels, in the Team tab above Acknowledgements We acknowledge the UCI repository for the data`'",tabular data,CO 544 Project,inClass,"A project for CO 544 2020, UOP, Sri Lanka",categorizationaccuracy,co-544-project 827,"'`The Earth is HUGE! Over 196.9 million square miles. While most of that is ocean, the remaining land is constantly changing as new roads are constructed and old roads are destroyed or abandoned. Automating the detection and labeling of roads is extremely useful for regular citizens looking at Google, governments planning and maintaining their land, and militaries ensuring national defense. What YOU do For this term project, students will create a model that can segment satellite images to identify roads. They will then use their model to generate predictions on an unlabeled (without the masks) set of images and submit their predictions to Kaggle. Students will compete for both the highest score, the most unique and novel approaches but above all they will learn how to work on large data science projects Below is a satellite image with its road mask overlaid on it. It can be seen that the roads have been whited out by the mask. ( Segmented Image) (Image source DeepGlobe.org) Why you should do it short run Kaggle competition scores and ranking will influence grades. However, the process, term paper, project presentation, blog posts, displayed continuous effort and improvement, innovation, sportsmanship (helping other students with published kernels or realizations), and finally displayed understanding and learning are much more important! Why you should do it long run The ability to work on large complex problems is an extremely valuable skill. The ability to implement methods and use human intuition and creativity to develop novel solutions to new problems is extremely valuable. Great rewards await people who solve problems that do not have published solutions. This is a chance in a low risk situation to practice and test your abilities. For example: A team that attempts to implement their own creative ideas while researching and learning about segmentation problems will likely receive a higher grade than a team that only implements an already published method that performs very well. Acknowledgements We thank DeepGlobe (http://deepglobe.org) for making this dataset available for academic use.`'",image data,COMP 540 Spring 2019,inClass,Detect roads in satellite images,dice,comp-540-spring-2019 828,"'`The evaluation pages describes how submissions will be scored and how students should format their submissions. You don't need a sub-title at the top thepagetitleappearsabove. For an example of a high quality evaluation page, see the 2013 KDD Cup page, which we have copied here: The evaluation metric for this competition is Mean F1-Score. The F1 score, commonly used in information retrieval, measures accuracy using the statistics precision p and recall r. Precision is the ratio of true positives tp to all predicted positives tp+fp. Recall is the ratio of true positives to all actual positives tp+fn. The F1 score is given by: \[ F1 = 2\frac{p \cdot r}{p+r}\ \ \mathrm{where}\ \ p = \frac{tp}{tp+fp},\ \ r = \frac{tp}{tp+fn} \] The F1 metric weights recall and precision equally, and a good retrieval algorithm will maximize both precision and recall simultaneously. Thus, moderately good performance on both will be favored over extremely good performance on one and poor performance on the other. Submission Format For every author in the dataset, submission files should contain two columns: AuthorId and DuplicateAuthorIds. DuplicateAuthorIds should be a space-delimited list. Every AuthorId counts as his/her own duplicate, and every duplicate should be listed under each of its respective ids. For example, if you suspect author A, B, and C are the same, you should list A,ABC, B,BAC, C,CAB. The file should contain a header and have the following format: AuthorId,DuplicateAuthorIds 1,1 8,8 9,9 10 10,10 9 etc.`'",tabular data,Competio curso de vero,,,categorizationaccuracy,competio-curso-de-vero 829,"'` . 15000 457 , 184 . 30 . ! ( ). , (). My Team. , , . . , . , . Acknowledgements .`'",tabular data,"Competition 1, SHAD, Fall 2018",inClass,Прогнозирование трат клиентов банка,rmsle,"competition-1,-shad,-fall-2018" 830,"'`This challenge serves as final project for the ""How to win a data science competition"" Coursera course. In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.`'",time series,Final project: predict future sales,,"Final project for ""How to win a data science competition"" Coursera course",rmse,final-project:-predict-future-sales 831,"'`Essa a pgina principal do projeto do primeiro mdulo da capacitao em IA do Virtus Up. O desafio ser desenvolcer agentes para o jogo Connect X. Essa no uma competio convencional do Kaggle, nela queremos que os alunos criem um agente e submetam o desempenho dele contra outros agentes. Existe competies oficiais onde podem ser submetidos comportamentos via arquivo .py, no entanto esse tipo de competio no est disponvel para todos. Essas competies se chamam Simulations. Essa competio foi baseada na competio https://www.kaggle.com/c/connectx`'",tabular data,Connect X - Virtus Up,,,mae,connect-x-virtus-up 832,"'`BLACKBOX The baseline value for the competition has been set to 80% accuracy score. Do remember that the predicted value will be checked and cannot be all 1 or 0. Make a suitable model. All the best to everyone. Task The goal is to identify the people's opinion on mobile phones using Feed Forward Neural Network. The data points are scraped from 91mobiles.com. Evaluation Metric Submissions are evaluated on **Accuracy Score** between the predicted and the actual labels on test dataset Acknowledgements Mobile Data Source: https://www.91mobiles.com/ sk`'",tabular data,Consumer Like or Dislike,,,categorizationaccuracy,consumer-like-or-dislike 833,"'`General Information The COVID-19 is a hot topic around the worldwide. COVID-19 (previously known as 2019 Novel Coronavirus, or 2019-nCoV), is a new respiratory virus first identified in Wuhan, Hubei Province, China. This competition offers you the task of predicting COVID-19 spread and determining the factors that affect the spread. This task is based on various regions around the world. You need to estimate the spread of the virus in these areas. Main Tasks - According to the given dataset, try to predict the number of confirmed and death cases for the next ten days in different areas in the world. -Evaulate which factors affect the spread of Covid-19. -Provide the information of which countries (according to health history) will be affected by Covid-19 more than others. - Which populations are at risk of contracting Covid-19 more than other countries? Dataset Sponsors Global AI Hub (https://globalaihub.com/) AI Business School (https://aibusinessschool.com/)`'",tabular data,COVID-19 Predictive Analysis,,,rmsle,covid-19-predictive-analysis 834,"'`For COVID-19 patients the common stage in diagnosis is computer tomography (CT). A radiologist is often asked to estimate the extent of damage with respect to lung volume. It is a time-consuming procedure, because radiologist should look through all axial slices on CT and segment each of them. There are some COVID-specific findings: https://radiologyassistant.nl/chest/covid-19-corads-classification. In this challenge you are asked to segment ""ground-glass"" and ""consolidation"". The goal of this challenge is to help companies, who are building CT-specific software, come up with better solutions.`'",image data,COVID-19 CT Images Segmentation,,,f_{beta},covid-19-ct-images-segmentation 835,"'`This competition is for Kent Ridge AI scholars to practice binary classification on a very challenging COVID-19 CT scan datasets. It is said that good classification of such datasets can help with early diagnosis and can play a crucial role to stop this pandemic. With the high number of false negatives of swabs, CT scans might provide a more reliable (albeit more expensive) way of diagnosis. Indeed, in a study of more than 1,000 patients published in the journal Radiology, chest CT outperformed lab testing in the diagnosis of 2019 novel coronavirus disease (COVID-19). The researchers concluded that CT should be used as the primary screening tool for COVID-19. Read more here. Grand Challenge Can YOU come up with a binary classification scheme to determine, just from the CT scan alone, whether someone has COVID-19? This classification scheme has to detect true positives well, but also must minimise false negatives! Facts about COVID-19 Below are some facts of COVID-19, extracted from WHO website. COVID-19 is the infectious disease caused by the most recently discovered coronavirus. This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019. COVID-19 is now a pandemic affecting many countries globally. People can catch COVID-19 from others who have the virus. The disease spreads primarily from person to person through small droplets from the nose or mouth, which are expelled when a person with COVID-19 coughs, sneezes, or speaks. These droplets are relatively heavy, do not travel far and quickly sink to the ground. People can catch COVID-19 if they breathe in these droplets from a person infected with the virus. This is why it is important to stay at least 1 meter) away from others. These droplets can land on objects and surfaces around the person such as tables, doorknobs and handrails. People can become infected by touching these objects or surfaces, then touching their eyes, nose or mouth. This is why it is important to wash your hands regularly with soap and water or clean with alcohol-based hand rub.`'",image data,CT Scan COVID Prediction,,,meanfscore,ct-scan-covid-prediction 836,"'`Evaluation Criteria To achieve a passing grade, the Accuracy of the model has to be at least 70 (percent). Evaluation Metric Submissions are evaluated on Accuracy Score of the model. Submission File Format The file should contain a header and have the following format: Loan ID,Loan Status a3fdf1db-e991-4293-976c-7d35564c0aec,Charged Off c3f8006d-d1ef-4a94-ba55-c48034974205,Fully Paid 076b722f-3658-47a8-a7f1-5179a9b45ade,Fully Paid etc. You can download an example submission file (sample_submission.csv) on the Data page.`'",tabular data,Credit Risk Modeling,inClass,Credit risk modeling,categorizationaccuracy,credit-risk-modeling 837,"'`This is the home page of the competition. The first phase will be to use multiple regression to predict diamond price. This phase will use the lm() function. Phase 2 The next phase of the competition will be to redo the approach from phase 1 in Python. You can use the same algorithm or tweak it, based on what you learned. Phase 3 This phase will allow use of other functions other than lm(). This will give you a chance to explore boosting, neural networks, etc. Phase 4 Phase 4 will allow use of stacking and ensembling techniques to improve your model. Dataset Overview Here is a decent analysis of the diamonds dataset. https://www.kaggle.com/stansilas/shine-bright-like-a-diamond/notebook We thank developers of the ggplot2 library for providing this dataset.`'",tabular data,Tracy Regression,inClass,Competition to explore multiple regression for Tracy High School,rmsle,tracy-regression 838,"'`Cette page a pour but de dcrire le jeu de donnes ""Booking.com"". Vous devez tre capable de rpondre aux questions suivantes : Quel fichier dois-je choisir ? Quel est le format des features ? Quelle est la variable prdire ? Files descriptions Attention : le sparateur est la virgule ! train_booking.csv - Jeu de donnes d'apprentissage test_booking.csv - Jeu de donnes de test submission_booking.csv - Un exemple du fichier soumettre sur Kaggle Data fields Format de lecture : colonne - description colonne - format de colonne train_booking.csv id - Identifiant de l'annonce - string name - Nom de l'annonce - string description - Description de l'annonce- string neighborhood_overview - Description du quartier- string notes - Remarques sur le logement- string transit - Moyens de locomotion proche du logement - string access - Accs au logement- string intercation - Interaction avec l'hte- string house_rules - Rgles dans le logement- string host_id - Identifiant de l'hte- integer host_since - Date d'inscription de l'hte - Date host_id - Identifiant de l'hte- integer host_seniority - Anciennet en jour de l'hte (depuis son inscription) - integer hostresponsetime - Temps de rponse de l'hte une question - String hostresponserate - Taux de rponse de l'hte aux questions - String hostlistingscount - Nombre d'annonces de l'hte sur ""Booking.com"" - integer host_verifications - Vrifications sur l'hte - String neighbourhood_cleansed - Quartier le plus proche - String city - Ville de l'annonce - String zipcode - Code postal de l'annonce - Integer country - Pays de l'annonce - String latitude_booking /strong> - Latitude de l'annonce - Decimal longitude_booking - Longitude de l'annonce - Decimal property_type - Type de logement - String room_type - Type de chambre - String accommodates - Nombre de pices - Integer bathrooms - Nombre de salles de bains - Integer bedrooms - Nombre de chambres - Integer beds - Nombre de lits - Integer bed_type - Type de lit - String amenities - Equipements - String price - Prix d'une nuit dans le logement - Integer security_deposit - Dpt de garantie - String cleaning_fee - Frais de nettoyage - String guests_included - Nombre d'invits spciaux autoriss - Integer extra_people - Supplment () si prsence d'invits spciaux - Integer minimum_nights - Nombre minimal de nuits autorises - Integer maximum_nights - Nombre maximal de nuits autorises - Integer calendar_updated - Calendrier de l'annonce mis jour - String availability_30 - Nombre de nuits possibles pour les 30 prochains jours - Integer availability_60 - Nombre de nuits possibles pour les 60 prochains jours - Integer availability_90 - Nombre de nuits possibles pour les 90 prochains jours - Integer availability_365 - Nombre de nuits possibles pour les 365 prochains jours - Integer first_review - Date du premier commentaire - Date last_review - Date du dernier commentaire - Date reviewscoresrating -Nombre de commentaires ayant donn la note maximale sur le total de - commentaires ayant donn une note - Integer reviewscoresaccuracy - Score moyen sur la description du logement - Integer reviewscorescleanliness - Score moyen sur la propret du logement - Integer reviewscorescheckin - Score moyen sur le check-in du logement - Integer reviewscorescommunication - Score moyen sur la communication du logement - Integer reviewscoreslocation - Score moyen sur la localisation du logement - Integer reviewscoresvalue - Score moyen du logement - Integer cancellation_policy - Type d'annulation pour la rservation - String reviewspermonth - Nombre de commentaires par mois - Decimal geolocation - Golocalisation du logement (latitude, longitude) - String geopoint_announce - Golocalisation du logement - GeoPoint department - Dpartement du logement - String submission_booking.csv Id - Identifiant de l'annonce - string Predicted - Estimation du prix d'une nuit dans le logement - Double Extranal Data Stations RATP avec le fichier ""externaldataratp.csv"" latituderatp - Latitude - Double longituderatp - Longituderatp - String idrefzdl - Identifiant station - Integer garesid - Identifiant de la gare - Integer nomlong - Nom de la station v1 - String nummod - Numro modifi - Integer fer - Flag fer- Integer train - Flag train- Integer rer - Flag rer - Integer metro - Flag mtro - Integer tramway - Flag tramway - Integer navette - Flag navette - Integer val - Flag val- Integer rescom - Nom de la station v2 - String geopoint - GoPoint de la station - GeoPoint`'",tabular data,Data Science - Master,inClass,Modélisation Data Science des annonces Booking.com,rmse,data-science-master 839,"'`Welcome to the Data Science Guild's first Datathon Competition Description MNIST (""Modified National Institute of Standards and Technology"") is the hello world dataset of computer vision. It was released in 1999, and since then it has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike. In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. This dataset is not the real MNIST data but is quite similar. We encourage you to experiment with different algorithms to learn first-hand what works well and how techniques compare.`'",image data,Digit recognition,inClass,The Data Science Guild presents the first datathon - A twist to MNIST ,categorizationaccuracy,digit-recognition 840,"'` , . : 0/1. >2 Nave, 5 -- Advanced MAP@1. >2 , leaderboard MAP@1: , CS center kernels , .`'",tabular data,CSC: HW4 spring19,inClass,"И снова деревья, ансамбли, ...",auc,csc:-hw4-spring19 841,"'`You have just received a hologram message from a droid: Future Data Science Jedi Knight! Years ago, you served with my father in the Clone Wars. Now he begs you to help him in his struggle against the Empire. The Rebellion is under siege from a new division of the Empire propaganda video games! We need you to predict the sales of a collection of their video games. We'll use this information to bring peace to the Galaxy. This is our most desperate hour. Help me, Future Data Science Jedi Knight. Youre my only hope. [looks to the side quickly, then crouches to end the message] Best of luck to everyone and dont forget your Jedi Knight Training: Be Scrappy Radiate Positivity Pursue Mastery Work Together Make No Little Plans`'",tabular data,Flatiron School,inClass,Use your regression skills to save the Galaxy,mse,flatiron-school 842,"'`MBTI personality profile prediction In this challenge, you will be required to build and train a model (or many models) capable of predicting a person's MBTI label using only what they post in online forums. This challenge will require the use of Natural Language Processing to convert the data into machine learning format. This data will then be used to train a classifier capable of assigning MBTI labels to a person's online forum posts. Read more about the MBTI personality types here OR, better yet, take the test for yourself! Each MBTI personality type consists of four binary variables, they are: Mind: Introverted (I) or Extraverted (E) Energy: Sensing (S) or Intuitive (N) Nature: Feeling (F) or Thinking (T) Tactics: Perceiving (P) or Judging (J) Each person will have only one of the two categories for each variable above. Combining the four variables gives the final personality type. For example, a person who is Extraverted, Intuitive, Thinking and Judging will get the ENTJ personality type. You will need to build and train a model that is capable of predicting labels for each of the four MBTI variables - i.e. predict four separate labels for each person which, when combined, results in that person's personality type.`'",text data,Personality Profile Prediction,inClass,Classify a person's MBTI personality type using text from what they post online.,meancolumnwiselogloss,personality-profile-prediction 843,"'` Fashion-MNIST is Zalando's42,00028,00028x28 DataOverview https://www.kaggle.com/kenichinakatani/fashon-mnist-with-cnn/ Acknowledgements We thank Zalando for providing this dataset (Fashion MNIST).`'",image data,Fashion MNIST challenge2019,inClass,for IWASAKI ML CLASS ONLY,categorizationaccuracy,fashion-mnist-challenge2019 844,"'`This is the second competition to test your understanding of machine learning concepts learnt. For this competition, you will be applying your knowledge about algorithms and modelling to predict a customers purchase behaviour. Use the customer data provided to create a classification model that predicts a customer's likelihood to buy a bike. Ensure you apply what you've learned in the resources shared.`'",tabular data,Technidus machine learning competition 2,inClass,Making predictions on customer purchase behaviour,categorizationaccuracy,technidus-machine-learning-competition-2 845,'`Just classify the digits shown in the npz file and save the results as a CSV file. Accuracy is all that matters!`',image data,Digit Classification DL Workshop,inClass,Classifying funky digits,categorizationaccuracy,digit-classification-dl-workshop 846,"'`The objective of this competition is to allow the participants to experiment with feature selection, feature engineering and regression models. Target value is the price of house sales. Multiple features are provided; you should select the combination of features (and maybe add more features) to minimize the RMSE score. $$RMSE = \sqrt{\sum{i=1}^{n}{\frac{(x{i}^{p}-x_{i}^{t})^{2}}{n}}}$$`'",tabular data,House Sales,,,rmse,house-sales 847,"'`Essa uma competio de teste, com o objetivo de ensinar fundamentos de como participar e submeter uma analise na competio; aplicar a tcnica de NLP com Bag Of Word (BOW). O problema a ser resolvido a deteco de Fake News, essa atividade assola o mundo cybernetico gerando desinformao e at eventuais mortes. Com Machine Learning e Inteligencias artificial conseguimos aplicar algumas tcnicas para tentar amenizar esse problema e alertar as pessoas do potencial perigo que acreditar em uma noticia falsa e tambm ao compartilhar ela.`'",text data,Fake News e ML,inClass,Detectando Fake News com Machine Learning,categorizationaccuracy,fake-news-e-ml 848,'`Predict Facial Landmarks and win!`',image data,GL Hack: Landmarks,inClass,Predict face landmarks location,mae,gl-hack:-landmarks 849,"'`This is the final assignment of the Deep Learning QSTP - 2019. The task involves image classification on various types of indoor-scenery images. The total types (labels or classes) of images is 67. You have 10934 images to Train your model. You need to submit the predictions (the class type, an integer in the range [0,66]) for 4686 test images given in the sample_sub.csv file The public leaderboard will evaluate the accuracy of your predictions on 2343 i.e 50% of test images. The final rankings will be given by private leaderboard which will be based on accuracy on the remaining 50% of data. The private leaderboard will be displayed after the competition is over Use of transfer-learning is encouraged and you can use any pre-trained model available on Pytorch here. IMPORTANT: All the announcements from mentors, your doubts and discussions should be put up in the discussions section. You can use the TEMPLATE NOTEBOOK given in Kernels section to start with (You need to fork this notebook) or start with your own fresh code`'",image data,QSTP - Deep Learning 2019,inClass,The final assignement for QSTP Deep Learning 2019,categorizationaccuracy,qstp-deep-learning-2019 850,"'`Welcome Data Champions You've seen a small piece of the Android Permissions dataset but now you're getting all of it. This challenge tests what you've learnt in a competitive environment with a public and private leaderboard. The aim is predict if a given Android application is Malware or not. You'll be given 14889 labeled observations which you can use for training and validation and 6382 unlabeled observations which will be used for testing. Your position on the public leaderboard will depend on your predictions on 70% of the test data and the remaining 30% of the test dataset will be used to determine the private leaderboard which will only be revealed after the competition has ended. Where do I start? The first thing to do is read through the Data documentation and then download the data. Once you have saved and extracted it, you can read it into Python using Pandas and split the training data into Train and Validation datasets. Or you can leave the training data as is and use Cross Validation. Or you can do both, you can experiment and decide. Once you have the data loaded, you can start exploring and building models! Just follow the Data Champions lessons Acknowledgements The data are licensed under a Creative Commons Attribution 4.0 International licence. Changes were made to the data to prevent using the original data to cheat in this competition. We thank the data author, Arvind Mahindru and provide a DOI to the original dataset 10.17632/958wvr38gy.5`'",tabular data,Data Champions Android App Malware Prediction,inClass,"Using permissions for Android applications, predict whether or not the application is Malware or Benign",auc,data-champions-android-app-malware-prediction 851,"'`Welcome the Kaggle challenge for Project 2! As part of a successful submission for Project 2, we will expect you to make at least one andhopefully,multiple! submissions towards this regression challenge. In this challenge, you will use the well known Ames housing data to create a regression model that predicts the price of houses in Ames, IA. You should feel free to use any and all features that are present in this dataset. Goal It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. Evaluation Kaggle leaderboard standings will be determined by root mean squared error RMSE. $$\text{RMSE}=\sqrt{\sum\frac{(\hat{y}i - yi)^2}{n}}$$ Submission File Format The file should contain a header and have the following format:. Id,SalePrice`'",tabular data,DSI-US-8 Project 2 Regression Challenge,inClass,Predict the price of homes at sale for the Aimes Iowa Housing dataset,rmse,dsi-us-8-project-2-regression-challenge 852,"'`Welcome to the first DSNet Kaggle Competition + Kaggle Workshop Before we share the competition details, a few ground rules from the community: This competition will be a beginner friendly competition aimed at helping you get started with kaggle. Special recognition and awards will be granted to those who create useful and healthy discussions and kernels. If you're joining from a community outside of DSNet, Click here to join DSNet slack. The only pre-requisites for this competition are fast.ai Lecture 1-4. If you want to do a quick recap, here is L1+L2 review from our study group and L3+L4 review No Teams are allowed for this competition, the reason is to help you build the confidence for getting started individually. No Private discussion outside of Kaggle is allowed, for all discussions, we will be using Kaggle forums. We request you to avoid using DSNet slack for the discussions either, the idea is to get familiar with the Kaggle setup. We will be sharing (Dirty+Misleading) Starter Kernels, please feel free to use these cautiously, note that you will have to re-engineer these approaches in order to gain a winning solution. Schedule and How to Join Timings for the Live Workshop: 4-6 PM IST, July 6 Join URL (Zoom Call): Link to Join Note this will also be recorded and released on our YouTube Channel Competition Timeline: 4PM July 6 to 8AM July 8, 2019 (All timings in IST) Organising Team: Aakash NS Kartik Godawat Prajwal Prashanth Sanyam Bhutani Siddhant Ujjain`'",image data,DSNet: fastai Hackathon,inClass,Competition for Data Science Network Getting Started Workshop,categorizationaccuracy,dsnet:-fastai-hackathon 853,"'` Fashion-MNIST is Zalando's42,00028,00028x28 DataOverview Kernels( https://www.kaggle.com/kenichinakatani/makesubmission) Acknowledgements We thank Zalando for providing this dataset (Fashion MNIST).`'",image data,Fashion MNIST challenge201907,inClass,for IWASAKI ML CLASS ONLY,categorizationaccuracy,fashion-mnist-challenge201907 854,"'`Mini Machine Learning Competition This is a private ML competition for students of Data Mining and Probabilistic Reasoning class at the University of Tbingen WS19/20. Data This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. Goal The Goal is to create an ML model that predicts if a client will default on credit card payment in next month.`'",tabular data,[DM&PR WS19/20] Machine learning competition,,"A private ML competition for DM&PR students, Uni Tübingen",auc,[dm&pr-ws19/20]-machine-learning-competition 855,'``',tabular data,InClass Competition at Tokyo Metropolitan Univ.,inClass,Predicting the price for an Airbnb host.,rmsle,inclass-competition-at-tokyo-metropolitan-univ. 856,"'`Bored of MNIST MNIST ? MNIST Kannada digits ! Acknowledgements Kaggle thanks Vinay Prabhu for providing this interesting dataset for a Playground competition. Image reference: https://www.researchgate.net/figure/speech-for-Kannada-numbers_fig2_313113588`'",image data,Tobigs13_7week_competition,inClass,Let me do it again~,categorizationaccuracy,tobigs13_7week_competition 857,"'`AI Academy SubmissionKernel (Python or R) csvKernelnotebooksubmit KernelPublic Kernel Rules EvaluationRMSLE `'",tabular data,Exam for Students20200129,inClass,This is Exam for students in the Academy. Have fun with ML!,rmsle,exam-for-students20200129 858,"'`Data untuk kompetisi ini diberikan oleh salah satu ticketing company terbesar di Indonesia Pada kompetisi ini, peserta diminta untuk memprediksi apakah akan terjadi cross selling pada transaksi flight seorang pelanggan. Cross selling didefinisikan sebagai: Pelanggan membuat booking hotel bersamaan dengan membeli tiket pesawat Selamat mengerjakan dan selamat berkompetisi!`'",tabular data,Penyisihan Datavidia 2019,inClass,"Babak penyisihan untuk Datavidia, hosted by Arkavidia 6.0",meanfscore,penyisihan-datavidia-2019 859,"'`Steam : https://www.kaggle.com/tamber/steam-video-games gamegroup1group2usertraintest2group1usergroup2trainuser group2trainusergroup21group1 testusergroup2purchase`'",tabular data,DS2019 2,,,map@{k},ds2019-2 860,"'`Low Margins, High Importance Background: 80% of producing oil wells in the United States are classified as stripper wells. Stripper wells produce low volumes at the well level, but at an aggregate level these wells are responsible for a significant percentage of domestic oil production. Stripper wells are attractive to a company due to their low operational costs and low capital intensity - ultimately providing a source of steady cash flow to fund operations that require more funds to get off the ground. At ConocoPhillips, our West Texas Conventional operations serve as a source of organic cash flow to fund more expensive projects in the Delaware Basin and other unconventional plays across the United States. As a company, it is vital that this steady, low cost form of cash has a constant presence. As with all mechanical equipment, things break and when things break money is lost in the form of repairs and lost oil production. When costs go up cash goes down, but how can we predict when equipment will fail and use this information to drive down our costs? The Challenge: A data set has been provided that has documented failure events that occurred on surface equipment and down-hole equipment. For each failure event, data has been collected from over 107 sensors that collect a variety of physical information both on the surface and below the ground. Using this data, can we predict failures that occur both on the surface and below the ground? Using this information, how can we minimize costs associated with failures? The goal of this challenge will be to predict surface and down-hole failures using the data set provided. This information can be used to send crews out to a well location to fix equipment on the surface or send a workover rig to the well to pull down-hole equipment and address the failure. In addition to uploading a solution file (described in ""Evaluation""), teams will be asked to provide a ""kernel"" via a markdown file. The kernel provides us with your code and output in addition to answers for the prompts in the ""Kernels Requirements"" section. These prompts in the ""Kernels Requirements"" section will determine your overall placement in the competition.`'",tabular data,Predictive Equipment Failures,inClass,Predict downhole equipment failures using sensor data!,meanfscore,predictive-equipment-failures 861,"'`Description. Your task for this exercise is to develop machine learning models of your own choice to predict the likelihood of blood-brain barrier penetration for a chemical compound, based on the given chemical structural information. Evaluation. The Area Under the Receiver Operator Characteristic (AUC) will be used for evaluation (see http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html and https://en.wikipedia.org/wiki/Receiver_operating_characteristic for details about this metric. Submission. The submitted csv file should consist of two columns with the header of: ""TestId"" and ""PredictedScore"" (i.e. probability output or score for a chemical being BBB permeable). Leaderboard and final evaluation. The predictions on about 50% of the test data points are used to score the submission according to the AUC and maintain a public leaderboard. The predictions on the remaining test data points will be used, after the submission deadline, for the final evaluation and you receive marks according to the resulting area under the ROC curve (AUC) for the remaining test set. This prevents a high score in the final evaluation from being obtained through overfitting the public test data, but it means that the public leaderboard will not necessarily be indicative of final performance. Acknowledgements The original dataset can be found in https://pubs.acs.org/doi/suppl/10.1021/ci300124c The publication associated with this data is: Martins IF, Teixeira AL, Pinheiro L, Falcao AO. (2012). A Bayesian approach to in silico blood-brain barrier penetration modeling. Journal of Chemical Information and Modeling, 52(6),1686-97. We thank the authors with University of Lisbon for providing the original dataset.`'",tabular data,CSM/SEM6420 workshop,inClass,Compound classification using chemical structural information,auc,csm/sem6420-workshop 862,"'`Welcome to the Kaggle Challenge INFO 254/DATA144! In this lab, you will get a chance to use and apply all the concepts you have learnt so far. This is also a very good opportunity get your feet wet in data science competitions! You will also have a chance to experience first hand the problems data scientists can face while dealing with a dataset. The dataset we have chosen is yelp review dataset, similar to those you saw in previous labs, but with more columns. Your task is to predict the rating category (is_good_rating) based on other features. Feel free to play around with the data as much as you want. Explore it. Create new features. Drop unnecessary features. Slice and dice, aggregate, drill up and drill down. For this competition, you need to submit your code via Kaggle as well as bCourses. There should be only one submission per team (on both Kaggle and bCourses). The deliverables include: Write-up : A brief description of the approach you used to solve the problems. Submit this on bCourses. Be brief but include all that is important. What was the data like? How did you explore it? How did you transform it? Which models did you use and why? What were your results? Make sure to include your team members' names as well as your Kaggle team name. You can find a template linked on bCourses. Code : Include any notebooks or scripts you used in your solution. Submit as a zipped file on bCourses. Kaggle Submission : Submit your results to Kaggle for automatic scoring. Note that this competition is worth twice as much as the regular labs. You have the opportunity to score 20 points instead of 10.`'",text data,DMA Kaggle Challenge,,Classify ratings in yelp review datasets,categorizationaccuracy,dma-kaggle-challenge 863,"'`Your task is to predict type of toy from given features. Accuracy will be used to evaluate your predictions. When you submit your predictions only 50% of them will be evaluated and you can see your score on the Public Leaderboard. At the end of the contest the remaining 50% will be evaluated and final standings will be displayed on Private Leaderboard. Simple average of Public and Private Leaderboards will be considered as final score. This is competition is evaluative.`'",tabular data,Eval Lab 2 F464,inClass,The second evaluative lab of BITS F464 Fall 2019,categorizationaccuracy,eval-lab-2-f464 864,"'`This is training task. Original dataset is taken from http://ai.stanford.edu/~amaas/data/sentiment/ Predict the number of positive and negative reviews using either classification or deep learning algorithms.`'",text data,EPAM: Exercise 1 - Sentiment Analysis,,,meanfscore,epam:-exercise-1-sentiment-analysis 865,"'` 6. . 10 , 2 . - (K-means, DBSCAN, etc) . : numpy pandas sklearn.model_selection sklearn.metrics sklearn.multiclass sklearn.preprocessing scipy : : - https://www.kaggle.com/rounakbanik/the-movies-dataset , . https://www.kaggle.com/c/finec-1941-hw6/data , private kernel ; (@gorodec) , `'",tabular data,finec-1941-hw6,inClass,Clustering,macrofscore,finec-1941-hw6 866,"'`Our first task is to provide a summarization for all the provided 50k time series of both synthetic dataset and seismic dataset. Informally, summarization means to use less memory to represent the original dataset. This means usually summarization could bring space efficiency. For example, the size of our original dataset is 4 byte / float * 256 float numbers * 50000 series = 51.2 MB, with each series occupying 1024 bytes. Our summarization should be using less memory. In this project, we will need to summarize each series (of 1024 bytes) using 32 bytes 64 bytes 128 bytes We will evaluate the effectiveness of our summarizations for these 3 different cases. Notebook Submissions have to be in Python Go to 'Notebooks', then 'Your Work', then 'Create New Notebook' (or duplicate the template notebook) The submitted notebook need to have 3 functions for summarization and 3 functions for reconstruction with the exacts following names: sum32 sum64 sum128 rec32 rec64 rec128 Each summarization function take as an input argument the path of the file containing 50000 series (a series contains 256 data points of 4 bytes each). Each summarization function should calculate the summarization of the 50000 series, write the results into a file, then return the path of the summarization file named by concatenating to its name _sum32, _sum64, or _sum128 respectively. For instance, if the filename is syntheticsize50klen256znorm.bin, then the sum32 function should output filename path as syntheticsize50klen256znorm.bin_sum32. Each reconstruction function take as an input argument the path of the file containing 50000 summaries (of respectively 32 bytes, 64 bytes, or 128 bytes). Each reconstruction function should calculate the reconstruction of the 50000 series summaries, write the results into a file, then return the path of the reconstruction file named by concatenating to its name _rec32, _rec64, or _rec128 respectively. A solution template can be found here https://www.kaggle.com/abdumaa/submission-template-time-series Submission The submission file is a CV file containing a header 'id, expected' followed by the reconstructed time series based on 32-bytes, 64-bytes, and 128-bytes summaries as the following: Indexes of the time series (back to back ), based on 32-bytes summaries, span from id=0 to id =12799999. Indexes of the time series (back to back ), based on 64-bytes summaries, span from id=12800000 to id =25599999. Indexes of the time series (back to back ), based on 128-bytes summaries, span from id=25600000 to id =38399999. Here the official documentation to submit a solution from a notebook https://www.kaggle.com/dansbecker/submitting-from-a-kernel I would rather recommend this youtube video of how to participate and submit a solution in a kaggle competition from a notebook (kernel)https://www.youtube.com/watch?v=GJBOMWpLpTQ To follow up on the youtube video, if you cannot find the button ""Commit"", so you need to click on ""Save Version"", then choose ""Save & Run All (commit)"" Do not forget to share your submission notebook after you have submitted your solution: click on the button ""share"", type ""abdu maa"" into the ""collaborator"" search box click ""save"" Since the NAMES/TITLES of submissions and shared notebooks for both the summarization and similarity Kaggle competitions are NOT INFORMATIVE, please make sure to include you submission and notebook title in the online Google sheet https://docs.google.com/spreadsheets/d/1mtVwiG9aelYIeS44Aoz50396p2WCywCBE2aVTOvbut0/edit#gid=1976504894`'",tabular data,Data Series Summarization Project (v3),inClass,,rmse,data-series-summarization-project-(v3) 867,"'`Welcome to the 2020 CS Data Challenge Covid-20 context Unfortunately, we couldn't join all of you in same room for the reason you already know. This year the challenge takes place on the Kaggle In-Class platform. It provides some really interesting features which should help you to work from home. It allows you to : work as a team access data code online (notebook or scripts) submit your predictions We will of course guide you for your first steps on that platform. Challenge Organisation Data context We will dive into wind turbines world in order to predict more accurately the wind power production of turbines in terms of kW. The data was measured with sensors inside the turbines. The idea here is to determine a solution to predict a production (in kW) based on values measured by sensors within the wind turbines.`'",tabular data,CS Challenge,,,mae,cs-challenge 868,"'`Twitter Sentiment Analysis For this task you need to classify tweets according to their sentiment (strongly negative (0), negative (1), neutral (2), positive (3), highly positive (4)). For each tweet you will need to predict the its sentiment. Acknowledgements The task uses a subset of the data that was published in: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.`'",text data,CS98X Twitter Sentiment Classification,,,categorizationaccuracy,cs98x-twitter-sentiment-classification 869,"'`Human Gene Function Prediction Challenge Genome-wide screens are experiments where each gene in an organism is systematically perturbed to establish its relationship to a phenotype of interest. In mammalian cell lines, these experiments typically use pooled lentiviral CRISPR-Cas9 libraries to knock out every gene in the organism in a single screen. For example, if the goal is to identify genes which are potential therapeutic targets in a given cancer, a CRISPR genome-wide screen could be conducted on a cell line derived from that particular cancer to identify genes that are essential in that specific genetic background (read [1] and [2] for more information). Or, if the goal is to identify human gene-gene interactions - where a double-mutant organism displays unexpectedly strong or weak phenotypes with 2 specific mutations- a genome-wide screen could be conducted on a single-knockout cell line to construct double mutants. The genetic dependencies identified by such genome-wide screens have been shown to be very powerful for understanding gene function. Specifically, previous work in model organisms demonstrated that genes that exhibit highly similar dependency profiles tend to be involved in the same protein complex, pathway or biological process (e.g. see [3] for our previous results generating and interpreting this type of data in yeast). With the latest developments in CRISPR-Cas9 technology, genome-wide genetic screens can now be efficiently completed in human cells. This project focuses on developing machine learning approaches that use gene dependency data from genome-wide CRISPR-Cas9 screens to predict human gene function. Provided Data Human genetic interaction profiles: ~17,000 x ~600 (genes x cell lines) We will provide you with genome-wide screening data from CRISPR-Cas9 screens across several hundred cell lines derived from the Dependency Map project https://depmap.org/portal/. Each row of this matrix corresponds to a single human gene, and each column of this matrix corresponds to a different human cancer cell line. Large negative values reflect a specific genetic dependency on that particular gene in the corresponding cell line (i.e. cases where a gene is specifically essential for growth in that cancer cell line). This data has not been normalized, please either do standardization (subtract the mean divide by the standard deviation), quantile normalization, or any other techniques that you think might help. GO term annotation matrix: ~17,000 x 200 (genes x GO annotations) To support a supervised machine learning approach, we will provide you with labels for 200 different GO terms for which we would like you to build a supervised machine learning model. This matrix is binary and includes a row for each of the genes that appears in the genetic interaction profile matrix described above: a 1 in a given position of this matrix means that the gene is annotated with the corresponding columns GO term and a 0 means that the gene is not annotated with the corresponding columns GO term. A subset of gene annotations have been held back for us to evaluate the performance of your predictions and will be evaluated on the public and private leaderboards. Your Challenge Your goal is to use a supervised machine learning approach to predict GO biological process annotations for each of the 200 GO terms based on each genes interaction profile. More specifically, given a single genes interaction profile (i.e. a row), you should provide predictions for each of the 200 GO terms. You can train and evaluate your models on the genes that appear with labels in training data (1s or 0s; the training set), and you will submit your models predictions on the genes that are unlabeled in the test set (the validation set). We will independently evaluate all teams predictions mean columnwise ROC AUC as we discussed in class. Note that for genes in the validation set, we have recoded their names such that no other information other than the genetic interaction matrix can be used to predict function. References [1] Wang et al. Identification and characterization of essential genes in the human genome. Science. 2015 Nov 27;350(6264):1096-101. doi: 10.1126/science.aac7041. [2] Meyers et al. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nature Genetics 2017 October 49:17791784. doi:10.1038/ng.3984. [3] Costanzo et al. A global genetic interaction network maps a wiring diagram of cellular function. Science. 2016 Sep 23;353(6306). pii: aaf1420. [4] DepMap project website: https://depmap.org/portal/`'",tabular data,CSCI 5461 Spring 2020,inClass,Human gene function prediction challenge,mcauc,csci-5461-spring-2020 870,"'`Given an image of a persons face, the task of classifying the ID of the face is known as face classification. The input to your system will be two face images and you will have to predict the similarities of these two images whether they belong to the same person. The ground truth will be present in the training data and the network will be doing an N-way classification to get the prediction. You are provided with a validation set for fine-tuning your model.`'",image data,CUHKSZ-FaceComp,,,auc,cuhksz-facecomp 871,"'`El conjunto de datos contiene informacin real sobre accidentes de trfico. Se incluyen datos sobre el conductor, presencia de alcohol o drogas, condiciones atmosfricas y tipo de carretera. El objetivo es predecir si el accidente provocar vctimas mortales o no.`'",tabular data,Curso en Ciencia de Datos de la UGR (6 edicin),,,auc,curso-en-ciencia-de-datos-de-la-ugr-(6-edicin) 872,"'`Studierende des Machine Learning Kurses der FFHS haben Kreuze, Kreise und Plus-Symbole auf ein Blatt Papier gezeichnet und die einzelnen Symbole anschliessend als Bilddatei abgespeichert. Z.B. Ihre Aufgabe ist es, diese Symbole zu erkennen. Ihre Aufgabe ist es nun, fr diese Bilder in der Datei Xtest dieses Label vorherzusagen. Laden Sie eine Datei hnlich wie sample_submission.csv hoch. Diese hat genau 716 Zeilen, und sieht etwa so aus: id,target 715,0 716,2 717,1 ... 1429,2 Die erste Zahl ist der Bildindex. In der Datei test.tar.gz sind Dateien mit der Namensstruktur 1234-u567.png gegeben, d.h. -u???.png Die Ganzzahl nach dem Komma ist das Target-Label zum Bild und entweder Null (Kreuz), Eins (Kreis) oder Zwei (Plus) aus: 0,x,(Bild zeigt ein Kreuz) 1,o,(Bild zeigt einen Kreis) 2,+,(Bild zeigt ein Plus) `'",image data,Daan-Kreuz-Kreis-Plus,inClass,"Kreuz, Kreis oder Plus?",categorizationaccuracy,daan-kreuz-kreis-plus 873,"'`The DAT158-2019 class competition Welcome to the first assignment of DAT158ML, 2019. You'll work in teams trying to create machine learning models for predicting house prices. The competition is based on the data set studied in Chapter 2 of the course text book: Aurlien Gron Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. You'll learn everything you need to get started by reading through that chapter. May the best team win!`'",tabular data,DAT158-2019,,,rmse,dat158-2019 874,"'`Compulsory assignment 4 (CA4) for DAT200, spring 2020 Image credits: Photo by Wenniel Lun on Unsplash Photo by joah brown on Unsplash`'",,DAT200-CA4-2020,inClass,"Compulsory assignment 4 for DAT200, Spring 2020",categorizationaccuracy,dat200-ca4-2020 875,"'`You are given a dataset for classification. The train set contains 700 tuples and 66 attributes (including the Class) and the test set contains 300 tuples and 65 attributes. Build a model(s) that can label the test tuple into one of the four predefined classes [0,1,2,3]. You may use any of the algorithms that have been taught to you so far (including clustering). Usage of algorithms such as XGBOOST or Neural Networks is PROHIBITED.`'",,Data Mining Assignment 2,,,meanfscore,data-mining-assignment-2 876,"'`Welcome to our challenge! this is our first in-class competition on kaggle, so let's do our best ;) Your submission file must have 2 columns TransactionID and is Fraud`'",,Data Science Circle Challenge,inClass,Can you get the highest score??,auc,data-science-circle-challenge 877,'`Let's try to predict the price of a product based on a collection of over one hundred thousand reviews and other product features.`',,DataC'EPT :Wine prices prediction,,,rmse,datacept-:wine-prices-prediction 878,"'`Welcome to Data Maestro 2020! Do you ever look up at your computer screen and wonder whether what you saw the other day was a planet? No? Well, here's your chance! Who needs telescopes to detect a planet when you have data to play with! A label has been provided whether, given the features, an object is a planet / a star / neither. Your task is to design the model with the available data provided to you. May the best predictions win!`'",,Data Maestro 2020,inClass,Look! Up in the sky! It's a planet! It's a star! It's ..neither?,meanfscore,data-maestro-2020 879,"'`Prediccion de Radiacion Solar El dataset contiene columnas como velocidad y direccin del viento, humedad o temperatura. La competicin consiste en predecir la radiacin solar basndose en esos datos. Se tienen 4 meses de datos. La mtrica de error es el MSE, por lo tanto, cuanto menor sea mejor resultado ser. Las unidades de cada columna son: Radiation: Radiacin solar en watts por metro^2 Temperature: Temperatura en grados Fahrenheit Humidity: Humedad en % Pressure: Presin atmosfrica en mm de Hg WindDirection(degrees): Direccin del viento en grados Speed: Velocidad del viento en millas por hora Sunrise/sunset: Salida-Puesta del Sol (horario de Hawaii) UNIXTime: TimeStamp (segundos desde 01-01-1970) Data: Fecha Time: Hora`'",,Radiacin Solar,,,mse,radiacin-solar 880,"'`Reddit is an entertainment, social networking, and news website where registered community members can submit content, such as text posts or direct links, making it essentially an online bulletin board system. Registered users can then vote submissions up or down to organize the posts and determine their position on the site's pages. Content entries are organized by areas of interest called ""subreddits"". The subreddit topics include news, gaming, movies, music, books, fitness, food, and photosharing, among many others. When items (links or text posts) are submitted to a subreddit, users (redditors) can vote for or against them (upvote/downvote). Each subreddit has a front page that shows newer submissions that have been rated highly. Redditors can also post comments about the submission, and respond back and forth in a conversation-tree of comments; the comments themselves can also be upvoted and downvoted. The front page of the site itself shows a combination of the highest-rated posts out of all the subreddits a user is subscribed to. The Reddit website has an API and its code is open source. In July 2015, a Reddit user identified as Stuck_In_the_Matrix made public a dataset of Reddit comments for research. The dataset has approximately 1.7 billion comments and takes 250 GB compressed. Each entry contains comment, score, author, subreddit, position in comment tree and other fields that are available through Reddit's API. One of the user attributes that is not natively supported by the Reddit platform is the gender. However, in some subreddits, users can self report their genders as part of the subreddit rules. In the scope of this competition, users that self reported their gender are selected from the dataset, and your goal is to predict the gender of these users.`'",,Data Mining: Stat. Modeling & Learning from Data,,Determine the gender of Reddit authors using their comments,auc,data-mining:-stat.-modeling-&-learning-from-data 881,"'`Competio do Curso DataScience Belm Prof. Adriano Avelar.`'",,Competio DataScience Belm,,,auc,competio-datascience-belm 882,"'`Tic-tac-toe (also known as noughts and crosses or Xs and Os) is a paper-and-pencil game for two players, X and O, who take turns marking the spaces in a 33 grid. The player who succeeds in placing three of their marks in a horizontal, vertical, or diagonal row wins the game. Given is the some experiences of winning or losing a game when the the player has opted for the moves at each positions in the grid. The total positions being 9. The chance of having values at each column can be X or O or b for blank Predict the Outcome of the game`'",,Datathon19,inClass,Predict Outcome of the XO game,categorizationaccuracy,datathon19 883,"'`Mbak mawar merupakan seorang influencer besar di sebuah social media. Dalam kesehariannya mbak mawar sering kali mendapatkan email yang berisi proposal kerja sama, bisnis, dan lain sebagainya, namun belakangan ini mbak mawar mengeluh karena email yang ia dapati banyak yang merupakan email spam. Lalu pada suatu pagi kamu dan mbak mawar tidak sengaja bertemu di sebuah angkringan dan mbak mawar menceritakan keresahannya kepada kamu. Kamu sebagai orang yang baik hati diharapkan dapat membantu mbak mawar dengan melakukan prediksi apakah email tersebut spam atau tidak. Jangan lupa upload makalah di web Joints paling lambat tanggal 3 Mei 2020.`'",,Data Mining Joints 2020,,,categorizationaccuracy,data-mining-joints-2020 884,"'`Deep Learning Classwork Welcome to the Deep Learning Classwork! During this class you'll have the opportunity to test what you learned during the course. We set up a competition to make things more fun! You will have 4 hours to solve a classification problem on the Sign Language Digits Dataset, which contains images of hand gestures representing numbers from 0 to 9. Being a classification problem, given an image, the goal is to predict the correct class label. Please, find the dataset details at this link and a description on the evaluation procedure on the ""Evaluation"" tab. We have created a template notebook that you can use to implement your code. You can find it in the ""Kernels"" tab. Recommendations You can choose any framework we have seen during the course to solve the challenge (TensorFlow, PyTorch, Keras, SciKit-Learn) and you don't have any constraint on the algorithm to be used. We suggest you to organize your code following the approach shown during the lab sessions (e.g., data loader, training loop, etc). Feel free to use your notes, but please, do not copy any implementation from the internet (we are gonna watch your terminal! ).`'",,Deep Learning Classwork,inClass,Sign Language Digits Classification,categorizationaccuracy,deep-learning-classwork 885,"'`HW3-2-classification Put your article in, then get your category out! category(0~9): {'Japan_Travel': 0, 'KR_ENTERTAIN': 1, 'Makeup': 2, 'Tech_Job': 3, 'WomenTalk': 4, 'babymother': 5, 'e-shopping': 6, 'graduate': 7, 'joke': 8, 'movie': 9}`'",,NTUST-NLP-HW3,,,categorizationaccuracy,ntust-nlp-hw3 886,"'`The classification problem grabs the spot of most popular task in field of data science, at least 70% of the tasks in real-world problem is classification problem. This is of the same category where the participant has to classify the seeds to different categories based on the set captured characteristics of the seeds. The participants have to consider the amount of data available as the number of instance are quite less than a usual data science problem so they choose the statistical model which accommodate such constraints.`'",,Wingify - Devfest - Q2,inClass,Let's learn a little about seeds but are we the ones learning !!!,categorizationaccuracy,wingify-devfest-q2 887,"'`Welcome to the InClass Competition! You have a bank marketing records of people with deposit as a predicted outcome (binary classification, no - haven't placed the deposit (0), yes - have placed the deposit (1))`'",,Module 5. Supervised Learning,inClass,Build a classification model to predict if user placed a deposit.,auc,module-5.-supervised-learning 888,'``',,DM Assignment 3,,,auc,dm-assignment-3 889,"'`Se quiere construir un modelo predictivo del momento de toma de las concentraciones de partculas PM2.5 en funcin de la informacin predictiva. La principal motivacin para este tipo de modelo, basado slo en observaciones meteorolgicas, es la facilidad para obtener estos datos en el rea urbana, siendo menos costoso que un sistema fiable de monitoreo de aire.`'",,Clasificacion,,,logloss,clasificacion 890,"'`Se quiere construir un modelo predictivo de las concentraciones de partculas PM2.5, a partir del viento (velocidad y direccin) y los valores de precipitacin. La principal motivacin para este tipo de modelo, basado slo en observaciones meteorolgicas, es la facilidad para obtener estos datos en el rea urbana, siendo menos costoso que un sistema fiable de monitoreo de aire.`'",,Regresion,,,rmse,regresion 891,"'`Background Impervas main mission is to protect data and websites. Given a web request Impervas goal is to decide whether it is malicious or not. Analysing the request history of the user makes this decision easier and more accurate, but users request history is usually not available as requests on the web do not carry a unique identifier of the users. Task In this task you will determine if two different requests are originating from the same user (e.g., web browser). One of the methods to do so is to extract multiple features related to the users characteristics on every outgoing request, such as, its web browsers characteristics and its hardware information. These characteristics can be then used to construct a model that efficiently identifies unique users. You will be given (1) A train set of single user examples - each sample (row) is a request made by a user; (2) A test set of pair of two users characteristics - X1 and X2 users; and (3) Optional auxiliary train set, which was created by converting the train set into the format of the test set. Efficiency In real world implementations, there is a huge number of users in the system at any given moment. There is an even a larger amount of HTTP requests, that should be processed by the system. Building a similarity function between pairs of users characteristics could be a legitimate solution in some cases. However, if such a function is to be used, each new request will require O(n) calls to the function, if there are n existing users in the system. Our system requires real time responses to provide the best protection possible. Therefore, efficiency is top priority and thus solutions like similarity functions are not good enough. Therefore, we expect a solution that can identify a user in O(1). An efficient algorithm can be achieved by constructing a unique string for each of the users. Storing the identifying strings in a hash table, can be used to retrieve an existing identifying string in O(1) complexity. Goal Your goal is to create a unique string for each user, based on their known characteristics. Assume that the characteristics importance changes over time. Your solution should not be based on specific features selection, but on a model that can adjust to such changes.`'",,"DMBI 2019 Datahack- Device fingerprinting, Imperva",inClass,"Determine if two different requests, possibly targeted to two different sites, are originating from the same user.",f_{beta},"dmbi-2019-datahack--device-fingerprinting,-imperva" 892,"'`A The AI Academy em parceria com o Grupo DATA da USP So Carlos realiza a competio de Machine Learning DOJO ML So Carlos. A competio permitir aos participantes encontrar solues para um problema real de manuteno preditiva de uma empresa. Para participar a essa competio preciso se cadastrar seguindo os links a partir deste site: Inscrio na Competio Introduo Um grande problema enfrentado pelas empresas em setores com ativos pesados, como a manufatura, so os custos significativos associados a atrasos no processo de produo devido a problemas mecnicos. A maioria dessas empresas est interessada em prever antecipadamente esses problemas, para que possam prevenir os problemas antes que ocorram, o que reduzir o impacto oneroso causado pelo tempo de inatividade. O Desafio O problema de negcios desta competio sobre a previso de problemas causados por falhas de componentes. A pergunta que a empresa precisa responder : Qual a probabilidade de falha de uma mquina no futuro prximo devido a uma falha de um determinado componente? Nessa competio voc e sua equipe utilizaram informao histrica de uma mquina, alm de seus dados de uso e do operador por trs para predizer a probabilidade de uma mquina falhar. O problema pode ser formatado como um problema de classificao de vrias classes e o desafio desta competio criar o modelo preditivo que aprende com dados histricos coletados de algumas mquinas. Fornecemos um conjunto grande de dados disponveis, separados em diversas tabelas - est na mo de cada equipe utilizar da melhor maneira.`'",,DOJO ML - So Carlos,,,meanfscore,dojo-ml-so-carlos 893,"'`Problem Overview We propose to use Dreem 2 headband data to perform sleep stage scoring on 30 seconds epochs of biophysiological signals. Context Sleep plays a vital role in an individuals health and well-being. Sleep progresses in cycles that involve multiple sleep stages : wake, light sleep, deep sleep, rem sleep. Different sleep stages are associated to different physiological functions. Monitoring sleep stage is beneficial for diagnosing sleep disorders. The gold standard to monitor sleep stages relies on a polysomnography study conducted in a hospital or a sleep lab. Different physiological signals are recorded such as electroencephalogram, electrocardiogram etc. Sleep stage scoring is then performed visually by an expert on epochs of 30 seconds of signals recording. The resulting graph is called a hypnogram. He provides a compact description the night. Additionnaly, a lot of informations on sleep stages and feature extraction are give in this pdf. The Dreem headband Dreem headband allows doing polysomnography at home signal thanks to three kinds of sensors: electroencephalogram (EEG), pulse oximeter and accelerometer signals. Since the Dreem headband records a lot of nights every day, we spent time developing the most accurate automatic sleep staging algorithms. Data was labeled directly on our data by trained sleep expert. In this challenge we provide you with such labelled data. The idea is to develop an algorithm of sleep staging able to differentiate between Wake, N1, N2, N3 and REM on windows of 30 seconds of raw data. The raw data includes 7 eegs channels in frontal and occipital position, 1 pulse oximeter infrared channel, and 3 accelerometers channels (x, y and z). You can find a lot of informations on Dreem at dreem.com`'",,Dreem 2 Sleep Stage Classification Challenge,inClass,Convert epochs of 30 seconds of Dreem signal into sleep stages,meanfscorebeta,dreem-2-sleep-stage-classification-challenge 894,"'`Context Every third bite of food relies on pollination by bees. At the same time, this past winter honeybee hive losses have exceeded 60% in some states. How can we address this issue? How can we better understand our bees? And most importantly, how can we save them before it's too late? While many indications of hive strength and health are visible on the inside of the hive, frequent check-ups on the hive are time-consuming and disruptive to the bees' workflow and hive in general. By investigating the bees that leave the hive, we can gain a more complete understanding of the hive itself. For example, an unhealthy hive infected with varroa mites will have bees with deformed wings or mites on their backs. These characteristics can be observed without opening the hive. To protect against robber bees, we could track the ratio of pollen-carrying bees vs those without. A large influx of bees without pollen may be an indication of robber bees. This dataset aims to provide basic visual data to train machine learning models to classify bees in these categories, paving the way for more intelligent hive monitoring or beekeeping in general. Help Material We have uploaded online lecture of image processing done for Machine Learning under Hackathon section of Google Classroom Some more links for help: Image Processing Using Python OpenCV Resize Image in OpenCV Python os library tutorial`'",,DS-14 Hackathon,,,meanfscore,ds-14-hackathon 895,"'`Can you predict which water pumps are faulty? Using data from Taarifa and the Tanzanian Ministry of Water, can you predict which pumps are functional, which need some repairs, and which don't work at all? Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania. This predictive modeling challenge comes from DrivenData, an organization who helps non-profits by hosting data science competitions for social impact. The competition has open licensing: ""The data is available for use outside of DrivenData."" We are reusing the data on Kaggle's InClass platform so we can run a weeklong challenge just for our Lambda School DS1 cohort. The data comes from the Taarifa waterpoints dashboard, which aggregates data from the Tanzania Ministry of Water. In their own words: Taarifa is an open source platform for the crowd sourced reporting and triaging of infrastructure related issues. Think of it as a bug tracker for the real world which helps to engage citizens with their local government. We are currently working on an Innovation Project in Tanzania, with various partners.`'",,DS1 Predictive Modeling Challenge,inClass,Can you predict which water pumps are faulty?,categorizationaccuracy,ds1-predictive-modeling-challenge 896,"'`Can you predict which water pumps are faulty? Using data from Taarifa and the Tanzanian Ministry of Water, can you predict which pumps are functional, which need some repairs, and which don't work at all? Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania. This predictive modeling challenge comes from DrivenData, an organization who helps non-profits by hosting data science competitions for social impact. The competition has open licensing: ""The data is available for use outside of DrivenData."" We are reusing the data on Kaggle's InClass platform so we can run a weeklong challenge just for your Lambda School Data Science cohort. The data comes from the Taarifa waterpoints dashboard, which aggregates data from the Tanzania Ministry of Water. In their own words: Taarifa is an open source platform for the crowd sourced reporting and triaging of infrastructure related issues. Think of it as a bug tracker for the real world which helps to engage citizens with their local government. We are currently working on an Innovation Project in Tanzania, with various partners.`'",,DS16 Predictive Modeling Challenge,inClass,Can you predict which water pumps are faulty?,categorizationaccuracy,ds16-predictive-modeling-challenge 897,"'` DS2 3 Kaggle . (Bosch) product line data . . . . .`'",,DS2 Kaggle Competition,,,matthewscorrelationcoefficient,ds2-kaggle-competition 898,"'` in-class Kaggle competition CO . CO 16 , 10 (train.csv) 2 (test.csv) CO . . https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+dynamic+gas+mixtures 20 submission , MAE(Mean Absolute Error) .`'",,Samsung DS2 Competition - Gas sensor data,,,mae,samsung-ds2-competition-gas-sensor-data 899,"'`Can you predict which water pumps are faulty? Using data from Taarifa and the Tanzanian Ministry of Water, can you predict which pumps are functional, which need some repairs, and which don't work at all? Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania. This predictive modeling challenge comes from DrivenData, an organization who helps non-profits by hosting data science competitions for social impact. The competition has open licensing: ""The data is available for use outside of DrivenData."" We are reusing the data on Kaggle's InClass platform so we can run a weeklong challenge just for your Lambda School Data Science cohort. The data comes from the Taarifa waterpoints dashboard, which aggregates data from the Tanzania Ministry of Water. In their own words: Taarifa is an open source platform for the crowd sourced reporting and triaging of infrastructure related issues. Think of it as a bug tracker for the real world which helps to engage citizens with their local government. We are currently working on an Innovation Project in Tanzania, with various partners.`'",,DS4 Predictive Modeling Challenge,inClass,Can you predict which water pumps are faulty?,categorizationaccuracy,ds4-predictive-modeling-challenge 900,"'`Chukwudi Supermarkets is a leading indigenous chain of supermarkets with headquarters in Oshodi, Lagos, Nigeria. Its success has been driven by strong entrepreneurial value and commitment to excellence in providing products to all segments of the population at value for money prices, as underscored by its slogan ""Cheap and Cheerful"". It offers over 1,500 products across 10 stores in different cities of Nigeria. Mr M.N. Chukwudi, the Chairman of the company is exploring a strategic expansion into more cities in Nigeria but he wants to understand what product gives a better margin at specific stores. For example, a tin of milk which sells for N100 in one of his supermarket branches may also be sold at N110 at another supermarket within Mr Chukwudi's chain of supermarkets. He needs to, therefore, u understand what type of product, market clusters and supermarket type (location, age, size) will give more margin as he plans to expand to more cities in the country. You have been engaged as the new Retail Data Analysts to build a predictive model and find out the sales of each product at a particular supermarket. With your guided analysis, Mr Chukwudi will understand the key characteristics of products and supermarkets driving sales and be better informed on an optimal template for its planned expansion to other states in Nigeria. The data provided comprises of transaction records of all the supermarket at the product level. Please note that the data may have missing values as some stores might not report all the data due to technical glitches as a result of NEPA/generator failure. Variable Description Item_Identifier: Unique product ID Product_Weight: Weight of the product Product FatContent: Level of fat in the product Product ShelfVisibility: The % of the total display area of all products in Mr Chukwudi supermarket allocated to the particular product Product _Type: The category to which the product belongs Product _Price: Retail Price of the product Supermarket_Identifier: Unique store ID SupermarketStartYear: The year in which store was opened Supermarket _Size: The size of the store in terms of total ground area covered SupermarketLocationType: The type of city in which the store is located Supermarket _Type: Description of the supermarket as a grocery store or some sort of supermarket ProductSupermarketIdentifier: Unique identifier of each product type per supermarket. Product_ Supermarket_Sales: Sales of the product in a particular store. This is the outcome variable to be predicted.`'",,DSN AI+ OAU: Challenge 2 (July),inClass,Predicting sales of specific product type at different supermarkets,rmse,dsn-ai+-oau:-challenge-2-(july) 901,'` (accuracy).`',,DUTh DEECE - Computer Vision 2019-20 - Homework 4,inClass,Homework 4 - Image Classification using Deep Learning,categorizationaccuracy,duth-deece-computer-vision-2019-20-homework-4 902,"'`Dataset : Fashion MNIST classification. We have our own new labels for it but you can read about the original dataset online in many places. Asg3 can be done alone or in pairs. Data: An image dataset based on ""Fashion MNIST"". The input features will be the same but the label you will be using will be new, something we have computed and created based on the data. Task: Explore the dataset, analyse its properties Perform classification of the new label Technologies to use: python scikitlearn keras and tensorflow Algorithms to use at minimum: PCA Clasic ML Method such as Random Forests or kernel SVM. CNNs you will also be able to use any other algorithms or Deep Neural Network architectures that you want to try out (e.g. Resnet, Inception, Autoencoders, ) Can be done alone or in pairs.`'",,ECE 657A W20 - Asg3 - Part 1,,,categorizationaccuracy,ece-657a-w20-asg3-part-1 903,'`Only two person group is allowed. Create a team and invite the other group member.`',,657atest,inClass,"University of Waterloo, winter term in 2020",categorizationaccuracy,657atest 904,"'`Challenge Description In todays technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options. With this context, EDSA is challenging you to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences. Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.`'",,EDSA Movie Recommendation Challenge,,,rmse,edsa-movie-recommendation-challenge 905,"'`Knowledge graphs have become very critical resources to support many AI related applications, such as graph analytics, Q&A system, web search, etc. A knowledge graph is a multi-relational graph composed of entities as nodes and relations as different types of edges. An instance of edge is a triplet of fact (head entity, relation, tail entity) (denoted as (h, r, t)). This task, as the second project in EE448 course, asks you to infer missing links in an observed academic knowledge graph. To avoid internal information leakage, all the entities are encoded into 8-digit hexadecimal numbers. By participating in this competition, you'll be helping to further your understandings of the current status of knowledge representation algorithms in solving link prediction problems for academic network. Acknowledgements This competition is hosted by EE448 TAs as the second project, with dataset and evaluation metric provided by Acemap group.`'",,EE448-2018: Link Prediction,inClass,Can you predict the missing links in an academic network?,map@{k},ee448-2018:-link-prediction 906,"'`Welcome Welcome to the first kaggle competition hosted by Emergent Leuven. Emergent is the pioneering student organization that highlights the technical, ethical and commercial aspects of data science problems. Theres a shortage of professionals who master an advanced set of skills in Data Science, but an increasing demand of these skills. With Emergent Leuven, we solve this problem in 3 ways: (1) Develop skills through workshops and educational tracks. (2) Apply skills at challenges and consultancy tracks. (3)Transfer mastered skills by teaching or coaching other students. We do this in close collaboration with industry and academia to bring students, researchers and professionals closer together. The challenge The goal of this challenge is to estimate the average review score a movie has received based on a bunch of variables. There are all sorts of variables in the dataset and many of them still need some kind of cleaning. There are no restrictions on the type of model that you can use. Emergent Leuven wants to give everyone the opportunity to apply their data science knowledge on a data set. People from all skill levels are encouraged to participate. Check out this free datacamp trial if you think that you don't have all knowledge required to successfully complete the challenge.`'",,Emergent Kaggle Competion,,,mse,emergent-kaggle-competion 907,"'`Welcome to the Emojify Challenge! In this project, you sill solve a Sentiment Classification problem where phrases are classified according to the sentiment they contain. ![sdf]( )`'",,ML Project - Emojify,,,categorizationaccuracy,ml-project-emojify 908,"'` - . HR- . , , , , . , , - . , HR- . , , . HR-. * *: , : ( ) . , https://colab.research.google.com/drive/1hflO0FV-SyEn0kI8dgL4cLqy_WRmFkpR?usp=sharing`'",,Employee resignation,,,ap,employee-resignation 909,"'` Amazon Kindle Store. : F1.`'",, ,,,meanfscore,---- 910,"'`You are given data concerning features of houses/flats of an European city (train_data files). The aim of the challenge is to provide the best predictions of the prices of the houses/flats given in the the test data (test_data files). First, join the competition (top right corner button of this window). We provide two notebooks, which are available below, to help you : ""Notebook with detailed explanations"" and ""Compact notebook'. The first notebook describes in details each step of the algorithm while the second notebook is a compact version of the first. Once you clicked on one of them, you can edit it (Copy & Edit button at the top right corner) and work with a copy of this notebook. If you wish to do this challenge in team of two students (at most), go to the team tab and in the team name field, put the two surnames of the teammates : name1-name2. Beware you should submit your solution file as described at the end of each notebook to know your ranking. You can put at most ten submissions per day. Your score is computed on a fraction of the test_data. At the end of the competition, your score will be updated on the rest of the test_data. If you want to upload your own solution file, you can upload thanks to the ""Submit Predictions"" button at the top right corner of this page and provide a .csv file (comma separated values) meaning that the data have to be in the same column and separated by a comma. An example of such solution file is provided in the data tab called : ""example_solution_file.csv"". Decimal numbers are written with the dot ""."" separator. Several micro-courses on Machine Learning and several examples of programming are available on Kaggle (just get lost on this site, you will find true treasures). Do not hesitate to be creative, curious about functions of Python libraries (sklearn documentation ) used in the given notebooks.`'",,First Machine Learning Algorithm,,,mae,first-machine-learning-algorithm 911,"'`Description du challenge Le but de cette comptition est de classifier le plus prcisment possible des dfauts de paiement chez les clients d'une banque Tawanaise. Il vous est fourni deux data sets. Le data set ""train_set"" sert entraner votre modle. Il comprend une premire colonne ""ID"" permettant d'identifier le client puis 23 variables explicatives puis une dernire colonne ""DEFAULT"" qui constituera les tiquettes/labels prdire. Pour plus d'information, voir la partie "" Description des donnes "". Le data set ""test_set"" est le data set sur lequel vous allez faire vos prdictions. Vos prdictions seront au format CSV comportant seulement deux colonnes. La premire colonne sera ""ID"" comprenant l'ID du client tudi et une deuxime colonne sera ""DEFAULT"" comprenant votre prdiction (0 ou 1) concernant ce client. C'est ce fichier de prdictions qu'il faudra submit. Le fichier ""sample.csv"" est un exemple de fichier submit.`'",,ESAIP - Data Mining 2020,inClass,Compétition de classification - ESAIP 2020 - CPI5,categorizationaccuracy,esaip-data-mining-2020 912,"'` ntuaha@gmail.com ntuaha-13240@email.esunbank.com.tw , , 20132848074920.172% 70%30%F1 score30%F1 score PCA(28)TimeAmountTime Amount train.csvClassbooleantest.csv exampleSubmission.csvTXKEYClass1 0TXKEYClass10`'",,2018 CRV,,,f_{beta},2018-crv 913,"'`AI Academy SubmissionKaggle Notebook (Python or R) csvNotebooksScriptNotebooksubmit NotebookPublic Notebook Rules EvaluationRMSLE `'",,Exam for Students20200527,,,rmsle,exam-for-students20200527 914,"'`Problem description A journal needs to catalog all your news in different categories. The objective of this competition is to develop the best deep learning model to predict the category of new news. The possible categories are: ambiente equilibrioesaude sobretudo educacao ciencia tec turismo empreendedorsocial comida`'",,FASAM - NLP Competition - Turma 3,inClass,Predict News Category,meanfscore,fasam-nlp-competition-turma-3 915,"'`Brick-and-mortar grocery stores are always in a delicate dance with purchasing and sales forecasting. Predict a little over, and grocers are stuck with overstocked, perishable goods. Guess a little under, and popular items quickly sell out, leaving money on the table and customers fuming. The problem becomes more complex as retailers add new locations with unique needs, new products, ever transitioning seasonal tastes, and unpredictable product marketing. Corporacin Favorita, a large Ecuadorian-based grocery retailer, knows this all too well. They operate hundreds of supermarkets, with over 200,000 different products on their shelves. Corporacin Favorita has challenged the Kaggle community to build a model that more accurately forecasts product sales. They currently rely on subjective forecasting methods with very little data to back them up and very little automation to execute plans. Theyre excited to see how machine learning could better ensure they please customers by having just enough of the right products at the right time.`'",,Corporacin Favorita Grocery Sales Forecasting,,,custom metric,corporacin-favorita-grocery-sales-forecasting 916,"'`In Feedzai's challenge Catch the Fraudster! you will have the chance to take the lead as a Data Scientist fighting fraud! On a daily basis at Feedzai, predictive models process millions of transactions for large banks and online merchants. These models are generated with historical data using the latest Machine Learning/Artificial Intelligence techniques. In this challenge you will have available a dataset with about 40 million instances, with some characteristics similar to the datasets processed by Feedzai(*). One of the most important such properties is the large asymmetry between the number of fraudulent transactions and the number of legitimate transactions, which makes the data highly unbalanced. This will challenge you to use strategies adapted to catch fraud. (*) For privacy reasons, the data for this competition was obtained by post-processing a publicly available dataset with several characteristics in common with the type of data that Feedzai processes daily. Instructions The dataset that we have prepared to illustrate Feedzai use case, has a rate of positive class cases (fraud) of about 16%. Though this is larger than typical fraud rates at Feedzai (which can be 1% or smaller), it will already allow you to explore some strategies adapted to imbalanced datasets. Another important issue that you will have to deal with is that this data set is already a large one (about 32 million instances), so you will have to be careful with the following points: Due to the size of the dataset, you will need to use a sample instead of using the full dataset, so that you can iterate efficiently. Please also note that the dataset contains time dependencies so you will have to be careful on how to split your dataset for train and validation of the model. We don't mind that you do an aggressive sample in order to fit your available computing resources, even if it hurts the model performance a bit. However, you will need to think very carefully how you are going to make your sample. Remember, the features you decide to design likely impact the way you build the sample. Focus on feature engineering and data understanding/exploration and not so much on exploring all types of models available. In particular, focus on which type of features you can build to characterize user past behavior. If you are not sure what model to use, use random forests. How do you explore and visualize the data? What can you learn from the data? What features do you design and how do you evaluate those features? What hypothesis do you/your-features consider? We also advise you to start by building a simple baseline model using the raw features that you can use directly to setup the pipeline (see also this example kernel). Then you can re-iterate backwards to make more sophisticated features. Finally, check out the competition rules and some resources to reach out to the organizers or other participants in slack.`'",,Catch the Fraudster!,inClass,Use Machine Learning to find the best predictions on an imbalanced dataset,auc,catch-the-fraudster! 917,"'`Einleitung Hier wrde die allgemeine Beschreibung der Challange stehen, welche Art von Verfahren genutzt werden drfen und welche Funktion diesen zugrunde liegt. 1. Einleitung 2. Welche Verfahren sind zulssig? 3. Welche Python Libraries drfen genutzt werden? (vielleicht als Hilfe) `'",,FhWedel,inClass,FhWedelSeminar2019,mae,fhwedel 918,"'`As a credit company, it is important to know beforehand who is able to pay their loans and who is not. The goal of this puzzle is to build a statistical/machine learning model to figure out which clients are able to honor their debt.`'",,Desafio Fia,inClass,Machine Learning,auc,desafio-fia 919,"'`Submissions are evaluated on the log_loss between the predicted values and the ground truth. Submission File For each hashimgid in the test set, you must predict the probability that the Fiducial Image is error. (a number between 0 and 1). The file should contain a header and have the following format: hash_img_id is_error 534ea359731b00f4a040776029bbd487 0.6 96ab7f46a62c6831fdd44a4209087067 0.5 2e1801fffbb9546ca86501ac7e260f15 0.5 `'",,Fiducial Error Image Detection,inClass,How can we find dust in the Fiducial Image?,logloss,fiducial-error-image-detection 920,"'`Budi yang sedang bosan di rumah, ingin melakukan proses transcode video-video yang dimilikinya dengan menggunakan komputer lama milik ayahnya. Untuk membandingkan kecepatan proses tersebut dari berbagai video yang ada, Budi membagi video-video tersebut menjadi 4 kelompok, sesuai dengan waktu proses yang dibutuhkan setiap video. Masalahnya, dikarenakan harddisk komputer yang rusak, Budi kehilangan sebagian data yang sudah dikumpulkannya. Apakah kamu dapat membantu Budi dalam memprediksi kelompok dari video yang tersisa? Waktu pengerjaan adalah 5 jam, dimulai dari jam 09.00 WIB hingga 14.00 WIB. Jumlah submit maksimal yang diperbolehkan adalah 10 kali. Selamat mengerjakan! Pengumpulan file presentasi dilakukan di link: https://jointsugm.typeform.com/to/oXlQRV`'",,Final Data Mining Joints 2020,inClass,InClass Final DM JOINTS 2020,categorizationaccuracy,final-data-mining-joints-2020 921,"'` 1. : : ( ). 4 ; 1 , . ( ). (.. , //) single- : LinearRegression, Ridge, Lasso, ElasticNet. , private kernel ; (@gorodec) , `'",,", . ML, hw1",,,mse,",-.-ml,-hw1" 922,"'` . http://archive.ics.uci.edu/ml/datasets/Forest+Fires`'",,SejongAI..fired area prediction,,,rmse,sejongai..fired-area-prediction 923,"'`We have mainly 2 sets: Train and Test Train data set has 6 inputs (P,Q,X,Y,Z,T) for predicting two target values (A,B). You are expected to model - model your neural network train - train your model using train set test - predict the A and B values for the given inputs in test set (P,Q,X,Y,Z,T). submit the predictions - submit the predictions according to the given sample submission file in the correct format check your score - you can see your score after the submission. You can do 4 submission per day Acknowledgements I thank Professor Murat Karakaya , Ph.D. for providing this dataset.`'",,First challenge: Regression,,,mae,first-challenge:-regression 924,'`This will be a practice regression project. You should use linear regression (and its variants) to get your submissions!`',,Regression Practice for DS Online PT-012120 Cohort,,,rmse,regression-practice-for-ds-online-pt-012120-cohort 925,"'`Like last year's Higgs Boson Machine Learning Challenge, this competition deals with the physics at the Large Hadron Collider (LHC). However, the subject of last year's challenge, the Higgs Boson, was already known to exist. The aim of this year's challenge is to find a phenomenon that is not already known to exist charged lepton flavour violation thereby helping to establish ""new physics"". Flavours of Physics 101 The laws of nature ensure that some physical quantities, such as energy or momentum, are conserved. From Noethers theorem, we know that each conservation law is associated with a fundamental symmetry. For example, conservation of energy is due to the time-invariance (the outcome of an experiment would be the same today or tomorrow) of physical systems. The fact that physical systems behave the same, regardless of where they are located or how they are oriented, gives rise to the conservation of linear and angular momentum. Symmetries are also crucial to the structure of the Standard Model of particle physics, our present theory of interactions at microscopic scales. Some are built into the model, while others appear accidentally from it. In the Standard Model, lepton flavour, the number of electrons and electron-neutrinos, muons and muon-neutrinos, and tau and tau-neutrinos, is one such conserved quantity. Interestingly, in many proposed extensions to the Standard Model, this symmetry doesnt exist, implying decays that do not conserve lepton flavour are possible. One decay searched for at the LHC is (or 3). Observation of this decay would be a clear indication of the violation of lepton flavour and a sign of long-sought new physics. Competition Design You will be working with real data from the LHCb experiment at the LHC, mixed with simulated datasets of the decay. The metric used in this challenge includes checks that physicists do in their analysis to make sure the results are unbiased. These checks have been built into the competition design to help ensure that the results will be useful for physicists in future studies. To get started, review the Data Page, and be sure to download the Starter Kit. The Starter Kit will help you to get used to the unique submission procedure for this competition. Competition Video Tutorial You've got lots of questions. Researchers at CERN & LCHb have the answers. - What is the goal of this competition? (1:56) - Why is finding exciting? (2:18) - What are flavours? (4:10) - Why use machine learning to find ? (4:57) - How did you decide on the size of the dataset? (5:31) - Why is weighted AUC the evaluation metric? (6:09) - Why use Ds data for the Agreement Test? (7:53) - Why do we need a Correlation Check? (8:44) - How will the competition results impact what you do? (11:38) - How will the competition results be used at CERN? (12:17) Resources Flavour of Physics, Research Documentation Roel Aaij et al., Search for the lepton flavour violating decay , 2015, JHEP, 1502:121, 2015 New approaches for boosting to uniformity Acknowledgements This competition is brought to you by: Co-sponsored by: Additional support from: `'",,Flavours of Physics: Finding ,,,custom metric,flavours-of-physics:-finding--- 926,"'`The European Organization for Nuclear Research is the worlds largest high energy physics laboratory. LHCb is an experiment set up to explore what happened after the Big Bang that allowed matter to survive and build the Universe we inhabit today. The Yandex School of Data Analysis (YSDA) is a free Masters-level program in Computer Science and Data Analysis, which is offered by Yandex since 2007. The aim of the School is to train specialists in data analysis and information retrieval to be able to solve cutting edge industry problems as well as fundamental research challenges. YSDA is associated member of LHCb since December 2014. Yandex Data Factory are the Machine Learning and data analytics experts that use data science to improve business operations, revenues and profitability. By building upon the real-time personalisation and predictive analytics technology of parent company, Yandex, the fourth largest search engine in the world, Yandex Data Factory helps clients improve their business awareness through the exploitation of their own data. Yandex Data Factorys proven data science and technology continually analyses, tests, refines and reapplies hundreds of hypotheses to the customers datasets to determine the best next course of action. It offers tailored, scalable, SaaS-driven Machine Learning services to a wide variety of data-reliant verticals, such as retail, financial services, travel and telecoms, who wish to use their data for purposes such as improving personalisation, segmentation, churn prevention or fraud detection. Yandex Data Factory was founded in 2014 by Yandex and is headquartered in Amsterdam, operating throughout Europe. Intel (NASDAQ: INTC) is a world leader in computing innovation. The company designs and builds the essential technologies that serve as the foundation for the worlds computing devices. As a leader in corporate responsibility and sustainability, Intel also manufactures the worlds first commercially available conflict-free microprocessors. Additional information about Intel is available at http://newsroom.intel.com and http://blogs.intel.com. The University of Zurich is one of the leading research universities in Europe and offers the widest range of degree programs in Switzerland. It was founded in 1833 and currently has seven faculties: Philosophy, Human Medicine, Economic Sciences, Law, Mathematics and Natural Sciences, Theology and Veterinary Medicine. Warwick is one of the UK's leading universities, with an acknowledged reputation for excellence in research and teaching, for innovation, and for links with business and industry. Institute of Nuclear Physics, Polish Academy of Sciences. Founded in 1955 Institute of Nuclear Physics has become leading Particle Physics research institution and ranked as class A+ by Polish Ministry of Higher Education. Consistently ranked as one of Russias top universities, the Higher School of Economics is a leader in Russian education and one of the preeminent economics and social sciences universities in eastern Europe and Eurasia. `'",,Flavours of Physics: Finding (Kernels Only),,,custom metric,flavours-of-physics:-finding----(kernels-only) 927,'`Predict whether a flight will be delayed for more than 15 minutes.`',,mlcourse.ai: Flight delays,,,auc,mlcourse.ai:-flight-delays 928,"'`Business Objective Each Salesperson will be assigned with a target(no of products to be sold) for each months. Which is given to them based on their last month achievement(no of products sold) i.e 10% increased from the last achievement. Analysis to do We can implement a model to predict the target given to them for the next month based on their previous achievement instead of giving them simply 10% hike. Steps to be followed Business Objective Data set details EDA(Exploratory data analysis) Model Building Evaluation Deployment Acceptance criteria Entire EDA report based on the sales persons Different models with better performance and 95% of accuracy Deployment should be done in Rshiny or Flask and Heroku(optional)`'",,FMCG(Fast Moving Consumer Goods),inClass,To predict the achievement of sales in FMCG department,rmse,fmcg(fast-moving-consumer-goods) 929,"'` [17011766] [] kaggle Compete Forrest Gump(1994) Forrest Gump Compete kaggle_test kaggle_train submit_sample kaggle_test id Forrest Gump(1994) 9718 . kaggle train train test id Forrest Gump(1994) . sumbit_sample id result[Forrest Gump (1994)] id test id result[Forrest Gump (1994)] . https://www.kaggle.com/sengzhaotoo/movielens-small .`'",,SejongAI..[[Forrest Gump (1994)] ],,,rmse,sejongai..[[forrest-gump-(1994)]--] 930,"'`This competition is designed to ensure that you understand how to construct a robust classification model. You are given some data with a minimal description, and asked to use it to create a model for predicting a rare congenital disease. Best submission will receive $100 voucher from GA. Project Score Your submissions are evaluated using ROC AUC-score, and you will be required to submit the code for your model. The assignment scores will be determined using a combination of your leaderboard performance, and the effort that you put into developing the model. A zero score would be reserved for the submissions that do not predict the disease correctly for any of the patients (ROC AUC-score of 0.5). Acknowledgements We thank L. Yanhong for providing this dataset.`'",,predicting-life-threatening-condition,inClass,Developing a model for discovering rare life-threatening condition,auc,predicting-life-threatening-condition 931,"'`GAME-TEI 2020 Competition Created by Khoa Cao and Mark Gee Overview Kaggle is a competition platform for you to submit/examine your model's performance and approximate how you are going compared to other teams. The task is to train a classifier to diagnose pneumonia using CXRs. To begin, use the starter code in the Jupyter notebook provided here. Dataset The dataset is a collection of CXRs which have been classified into NORMAL/PNEUMONIA. For this competition, we have modified the original dataset by moving some images from the training set to the validation set in order to reduce the variance of the originally small validation set. We also removed the classification of the test images, re-ordered and re-named the test set images to prevent participants from directly training using the test set. Submission In order to view your test results, you will have to upload a .csv submission file. The .csv file can be produced using the code cell in Section 4 of the notebook. After submitting, you will be able to view your results on the leaderboard. As described in the notebook, the test results you can see on the public leaderboard will not be the final result. A separate, hidden test set will also be used to evaluate your results. This is a common practice to prevent teams from overfitting the test results. Submissions will be ranked based on AUROC. Acknowledgements Paul Mooney for original dataset. Licensed: CC BY 4.0.`'",,GAME-TEI 2020,inClass,Medical AI Competition,auc,game-tei-2020 932,"'`Every artist dips his brush in his own soul, and paints his own nature into his pictures. -Henry Ward Beecher We recognize the works of artists through their unique style, such as color choices or brush strokes. The je ne sais quoi of artists like Claude Monet can now be imitated with algorithms thanks to generative adversarial networks (GANs). In this getting started competition, you will bring that style to your photos or recreate the style from scratch! Computer vision has advanced tremendously in recent years and GANs are now capable of mimicking objects in a very convincing way. But creating museum-worthy masterpieces is thought of to be, well, more art than science. So can (data) science, in the form of GANs, trick classifiers into believing youve created a true Monet? Thats the challenge youll take on! The Challenge: A GAN consists of at least two neural networks: a generator model and a discriminator model. The generator is a neural network that creates the images. For our competition, you should generate images in the style of Monet. This generator is trained using a discriminator. The two models will work against each other, with the generator trying to trick the discriminator, and the discriminator trying to accurately classify the real vs. generated images. Your task is to build a GAN that generates 7,000 to 10,000 Monet-style images. Getting Started: Details on the dataset can be found here and an overview of the evaluation process can be found here. To learn how to submit and answers to other FAQs, review the Frequently Asked Questions. Recommended Tutorial We highly recommend Amy Jang's notebook that goes over the basics of loading data from TFRecords, using TPUs, and building a CycleGAN. Although the competition dataset only includes Monet images, check out this dataset for Cezanne, Ukiyo-e, and Van Gogh paintings to run your GAN on.`'",,Im Something of a Painter Myself,,,custom metric,im-something-of-a-painter-myself 933,"'`This is the final project for the GBM 2018-2019 course ""Machine learning and Brain-Computer Interfaces"". The objective is to classify MRI voxels into malignant and benign. Note that this dataset is private and should not be shared on the web. Acknowledgements We thank Carole Lartizien and Olivier Rouvire for providing this dataset.`'",,GBM 2018,inClass,Prostate cancer classification,auc,gbm-2018 934,"'`EMG Electromyography, a clinical test to assess the working of muscles and the nerves that control them. The diagnosis is done by evaluating the electrical signals produced by skeletal muscles. The gestures, which are daily hand movements as listed below, are recorded using an electromyogram and analyzed. Spherical: for holding spherical tools. Tip: for holding small tools. Palmar: for grasping with the palm facing the object. Lateral: for holding thin, flat objects. Cylindrical: for holding cylindrical objects. Hook: for supporting a heavy load. Prizes worth up to 2k (rupees) to be won. This challenge is hosted by Protocol,2019.`'",,Gesture Detection (Kernels Only),inClass,Detect the gesture made by the subject,mse,gesture-detection-(kernels-only) 935,"'`You are the Chief Data Officer of Rabobank, newly reinstated, and the Board of Directors has given you a new assignment: improving the credit decision process. The board has heard of new advancements in the field of maching learning and artificial intelligence, and wanting to also use these techniques, they have asked you to overhaul their old models and replace them with new ones. More specifically, they task you with developing something entirely new: a credit model. Instead of potential clients having to talk to an account manager face-to-face, new clients would be run through the model to see whether they should be given a loan or not. A client's financial information will be used to predict whether he will default (the money that the bank loaned to the client is gone, resulting in a loss for the bank) or the client will not default (the client will keep paying interest payments, earning the bank profit). An algorithm can work all night without fatiguing, increasing the efficiency of the bank and having more time left to help our clients grow.`'",,Rabo Modellathon,inClass,Create a credit decision algorithm in order to determine whether to accept clients by predicting whether or not they default,auc,rabo-modellathon 936,"'`2020.Spring.AI_TermProject 18011762 : , , : [_ ] https://data.kma.go.kr/climate/rainySeason/selectRainySeasonStdList.do?pgmNo=205 [_, ] https://data.kma.go.kr/data/grnd/selectAsosRltmList.do?pgmNo=36`'",,2020.Spring.AI.Termproject_ ,,,categorizationaccuracy,2020.spring.ai.termproject_- 937,"'`This is the home page of the competition. You don't need a subtitle here. The competition sub-title will appear above. This is where you introduce the problem. You can upload images using the ""select files"" widget on the left in the competition wizard. Upload an image, refresh the page, copy its URL, then insert within the wizard's editor. If you are copy-pasting from another application, like Word or your browser, try to make sure the html formatting is clean. You can view a page's html using the button at the top right of the editor's toolbar. This is a subtitle To format pages, stick to the following conventions: Paragraphs should go in p tags Code should go in pre tags Subtitles should go in h2 tags You can display equations using LaTeX enclosed in escaped brackets. For example, this: \[ \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } \] is created by this: \[ \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } \] Acknowledgements We thank Professor Plum, Ph.D. for providing this dataset.`'",,Raladores de Coco,inClass,Who is the best?,auc,raladores-de-coco 938,"'`BT4222 In-class Kaggle Competition We are hosting an in-class Kaggle competition to replace the midterm exam. The competition is a text categorization problem, i.e., labeling natural language texts with relevant categories from a predefined set [0, 1]. What will be the competition like? Binary classification of text Due on 04/04/2020 11:59 PM (UTC) Given a training dataset, make predictions on a test dataset You can make submissions in .csv format up to 3x per day Your top accuracy score will be ranked on a public leaderboard The public leaderboard is based on 40% of your submission At the end of the competition, you will be ranked on a private leaderboard The private leaderboard is based on the other 60% of your submission You may select 2 submissions for the final private leaderboard ranking Kaggle Tutorials How will you be graded? 28% of final grade 3 components: report, programming and ranking 1. Kaggle competition private leaderboard ranking (8%) You will be awarded 2% in this component if your submission outperforms the provided simple baseline. The other 6% will be awarded based on your relative ranking. 2. Programming (8%) You should submit a well documented jupyter notebook as well as a requirements.txt by 11th April 2020 23:59 Singapore time. Your code should be written in python 3. All external dependencies should be clearly specified. 2% clear instructions on how to run the code 2% error free code 4% documentation 3. Report (12%) You need to submit a short report by 11th April 2020 23:59 Singapore time. Name your report with your NUS Student Number, i.e. A0123456Z.pdf Your report must include the following 3 sections: 2.1 Data preprocessing, feature engineering, and how you came out with these approaches. 2.2 Model building and validation. 2.3 The models you explored, why they were chosen, and how you arrived at the final model for your top 2 submissions. There is a 4 page limit on substantive content comprising the above sections including tables and graphs, and any such content exceeding this limit will be ignored. You should use normal layout of microsoft word, calibri with size 11 and single spacing. For section heads and title, you should use the same font and size with bold. Finally, submit a Jupyter notebook (.ipynb) for the final model for your top submission, together with your report. Name this file with your NUS Student Number, i.e. A0123456Z.ipynb. The file should include your code for data preprocessing, feature engineering, prediction validation, and generating your .csv submission. Please also submit the requirements file with name: A0123456Z-requirements.txt The submission folder is LumiNUS > Kaggle Important Upon signing up for Kaggle and joining the competition, kindly change the Display Name in your Kaggle profile to your NUS student number, e.g. ""A0123456Z"". Do not sign up using more than one Kaggle account. This is a competition between individual participants. You should work on this problem on your own. Please read the Overview, Data, and Rules pages carefully. Please note with respect to the End Time of this competition (see Timeline) that it is specified in UTC.`'",,Answer Classification,inClass,The objective is to build a binary classification model to classify whether a given answer is a good answer or not. ,auc,answer-classification 939,"'`This is an introductory competition to test your understanding of machine learning concepts learnt. For this competition, you will be applying your knowledge about algorithms and modelling to predict customer average spend. Use the customer data provided to create a regression model that predicts a customer's average monthly spend. Ensure you apply what you've learned in the resources shared.`'",,Technidus Machine Learning Competition 2nd Cohort,inClass,Making spend predictions,rmse,technidus-machine-learning-competition-2nd-cohort 940,'`aaa`',,testhrr,inClass,testhrr,categorizationaccuracy,testhrr 941,"'`https://www.kaggle.com/gyejr95/tft-match-data kaggle data raw dataset . 2020 6 10 . ( URL ver2 .) Challenger . TFT , . Raw dataset game duration, level, last round, combination, champion column , combination champion column . dictionary . train, test data raw data Ranked(Label) 8 shuffle label . 8 , 1~4 , 5~8 . submission.csv id result , (1~4) 1, (5~8) 0 .`'",,TFT (/),,,categorizationaccuracy,tft--(/) 942,'`The dataset is collected from various web resources in order to explore the used cars market and try to build a model that effectively predicts the price of the car based on its parameters (both numerical and categorical)`',,The Consulting Club || Analytics Challenge,inClass,Used Vehicle Catalog 2019 - Predict price of the used vehicles ,rmse,the-consulting-club-||-analytics-challenge 943,"'`The Scientist's Experiment: A scientist was performing experiments in his lab. He found that atmospheric conditions had something to do with the strength of a specific type of bacteria residing on Coconut trees. He studied the strength of bacteria in specific conditions of temperature and wind velocity; he recorded 4000 observations for the same. Your job is to help the scientist find a relation between temperature and wind velocity with bacteria's strength. In train.csv, 4000 recorded observations are given. 1st column represents the observation Id. 2nd column represents the temperature. 3rd column represents the wind velocity. 4th column represents the bacteria's strength - 0 means bacteria has lost it's strength & 1 means that bacteria has not lost it's strength. Develop a model to find a relation between temperature and wind velocity with bacteria's strength. In test.csv, 2000 combinations of temparature and wind velocity are given. You need to predict whether bacteria has lost its strength or not (i.e., 0 or 1). 1st column represents the observation Id. 2nd column represents the temperature. 3rd column represents the wind velocity. You need to submit submission.csv where : 1st column represents the observation Id. 2nd column represents the bacteria's strength - i.e., 0 or 1 corresponding to the observation Id of test.csv.`'",,The Scientist's Experiment,inClass,An exciting prediction contest for ML enthusiasts,logloss,the-scientists-experiment 944,"'`In this competition, we want you to clean the given signal using your knowledge about signal filtering. However, you are free to use any operation on ""suphi_noisy.wav"" to clean it. A submission file should be a csv having 873694 lines. For example, for the noisy file we should hava a csv containing SampleID,Left,Right 0,98,98 1,107,107 2,100,99 3,156,156 4,178,179 5,172,172 6,155,155 7,123,123 873689,-58,-61 873690,-35,-33 873691,-3,-4 873692,-7,-6 If you have any problems you can send me an e-mail. Res. Asst. Yusuf H. Sahin`'",,The Suphi Challenge,,,mse,the-suphi-challenge 945,"'`! "" "" ! : (). : seq2seq , (, BERT, GPT-2 ..) - . . ("""") . seq2seq ( LSTM) attention . , , BERT.`'",,Arxiv Title Generation,,,f_{beta},arxiv-title-generation 946,"'`Bienvenue dans la comptition valeuriad Bonne chance`'",,Toxic comments valeuriad,,,categorizationaccuracy,toxic-comments-valeuriad 947,"'`Use transfer learning to properly predict flowers. See demo kernel for a starting point. Acknowledgements Dataset provided by Kaggle.`'",,TransferLearning Competition AppliedAI,inClass,Transfer Learning competition for Applied AI,categorizationaccuracy,transferlearning-competition-appliedai 948,"'`IIn this competition, you are given time-series historical data in product level consisting of daily transaction, the number of product click, the amount of product stock, how many times product is favourited, impression count, product price, product category, size, provided by largest and fastest growing mobile commerce company in Turkey and the MENA region, trendyol.com. We are asking you to forecast the sales of each product for next 7 days. This competition is a great chance to explore different models on real time series data and improve your skills in forecasting`'",,Trendyol Project,inClass,Data Analytics Challenge - Trendyol Projesi,rmse,trendyol-project 949,"'`El problema del vendedor viajero, problema del vendedor ambulante, problema del agente viajero o problema del viajante (TSP por sus siglas en ingls (Travelling Salesman Problem)), responde a la siguiente pregunta: dada una lista de ciudades y las distancias entre cada par de ellas, cul es la ruta ms corta posible que visita cada ciudad exactamente una vez y al finalizar regresa a la ciudad origen? Este es un problema NP-Hard dentro en optimizacin combinatoria, muy importante en la investigacin de operaciones y en la ciencia de la computacin. El objetivo de esta competicin es, a partir de 76 ficheros de datos, obtener los valores ptimos para cada uno de ellos. Cada fichero de datos representa un problema TSP diferente, con ciudades a visitar distintas. El mtodo de evaluacin es el error cuadrtico medio sobre el valor ptimo. El problema fue formulado por primera vez en 1930 y es uno de los problemas de optimizacin ms estudiados. Es usado como prueba para muchos mtodos de optimizacin. Aunque el problema es computacionalmente complejo, una gran cantidad de heursticas y mtodos exactos son conocidos, de manera que, algunas instancias desde cien hasta miles de ciudades pueden ser resueltas. El TSP tiene diversas aplicaciones an en su formulacin ms simple, tales como: la planificacin, la logstica y en la fabricacin de circuitos electrnicos. Un poco modificado, aparece como: un sub-problema en muchas reas, como en la secuencia de ADN. En esta aplicacin, el concepto de ciudad representa, por ejemplo: clientes, puntos de soldadura o fragmentos de ADN y el concepto de distancia representa el tiempo de viaje o costo, o una medida de similitud entre los fragmentos de ADN. En muchas aplicaciones, restricciones adicionales como el lmite de recurso o las ventanas de tiempo hacen el problema considerablemente difcil. El TSP es un caso especial de los Problemas del Comprador Viajante (travelling purchaser problem). En la teora de la complejidad computacional, la versin de decisin del TSP (donde, dado un largo L, la tarea es decidir cul grafo tiene un camino menor que L) pertenece a la clase de los problemas NP-completos. Por tanto, es probable que en el caso peor el tiempo de ejecucin para cualquier algoritmo que resuelva el TSP aumente de forma exponencial con respecto al nmero de ciudades.`'",,TSP ACO,inClass,Algoritmos y Computabilidad,rmse,tsp-aco 950,"'`=================================================================================================== Human Activity Recognition Using Smartphones Dataset Version 1.0 Jorge L. Reyes-Ortiz(1,2), Davide Anguita(1), Alessandro Ghio(1), Luca Oneto(1) and Xavier Parra(2) 1 - Smartlab - Non-Linear Complex Systems Laboratory DITEN - Universit degli Studi di Genova, Genoa (I-16145), Italy. 2 - CETpD - Technical Research Centre for Dependency Care and Autonomous Living Universitat Politcnica de Catalunya (BarcelonaTech). Vilanova i la Geltr (08800), Spain activityrecognition '@' smartlab.ws The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKINGUPSTAIRS, WALKINGDOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data. The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain. See 'features_info.txt' for more details. For each record it is provided: Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration. Triaxial Angular velocity from the gyroscope. A 561-feature vector with time and frequency domain variables. Its activity label. An identifier of the subject who carried out the experiment. The dataset includes the following files: 'README.txt' 'features_info.txt': Shows information about the variables used on the feature vector. 'features.txt': List of all features. 'activity_labels.txt': Links the class labels with their activity name. 'train/X_train.txt': Training set. 'train/y_train.txt': Training labels. 'test/X_test.txt': Test set. 'test/y_test.txt': Test labels. The following files are available for the train and test data. Their descriptions are equivalent. 'train/subject_train.txt': Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30. 'train/Inertial Signals/totalaccxtrain.txt': The acceleration signal from the smartphone accelerometer X axis in standard gravity units 'g'. Every row shows a 128 element vector. The same description applies for the 'totalaccxtrain.txt' and 'totalaccz_train.txt' files for the Y and Z axis. 'train/Inertial Signals/bodyaccx_train.txt': The body acceleration signal obtained by subtracting the gravity from the total acceleration. 'train/Inertial Signals/bodygyrox_train.txt': The angular velocity vector measured by the gyroscope for each window sample. The units are radians/second. Notes: Features are normalized and bounded within [-1,1]. Each feature vector is a row on the text file. The units used for the accelerations (total and body) are 'g's (gravity of earth -> 9.80665 m/seg2). The gyroscope units are rad/seg. A video of the experiment including an example of the 6 recorded activities with one of the participants can be seen in the following link: http://www.youtube.com/watch?v=XOEN9W05_4A For more information about this dataset please contact: activityrecognition '@' smartlab.ws License: Use of this dataset in publications must be acknowledged by referencing the following publication [1] [1] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013. This dataset is distributed AS-IS and no responsibility implied or explicit can be addressed to the authors or their institutions for its use or misuse. Any commercial use is prohibited. Other Related Publications: [2] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, Jorge L. Reyes-Ortiz. Energy Efficient Smartphone-Based Activity Recognition using Fixed-Point Arithmetic. Journal of Universal Computer Science. Special Issue in Ambient Assisted Living: Home Care. Volume 19, Issue 9. May 2013 [3] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. 4th International Workshop of Ambient Assited Living, IWAAL 2012, Vitoria-Gasteiz, Spain, December 3-5, 2012. Proceedings. Lecture Notes in Computer Science 2012, pp 216-223. [4] Jorge Luis Reyes-Ortiz, Alessandro Ghio, Xavier Parra-Llanas, Davide Anguita, Joan Cabestany, Andreu Catal. Human Activity and Motion Disorder Recognition: Towards Smarter Interactive Cognitive Environments. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013. ================================================================================================== Jorge L. Reyes-Ortiz, Alessandro Ghio, Luca Oneto, Davide Anguita and Xavier Parra. November 2013.`'",tabular data,UCI-HAR,inClass,UCI-HAR,categorizationaccuracy,uci-har 951,"'` Yahoo!45Positive12Negative NotebooksDiscussion`'",text data,Japanese Review Rating Prediction,inClass,日本語レビューデータを予測しよう,auc,japanese-review-rating-prediction 952,"'`1. Title: Car Evaluation Database 2. Sources: (a) Creator: Marko Bohanec (b) Donors: Marko Bohanec (marko.bohanec@ijs.si); Blaz Zupan (blaz.zupan@ijs.si) (c) Date: June, 1997 3. Past Usage: The hierarchical decision model, from which this dataset is derived, was first presented in M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for multi-attribute decision making in 8th Intl Workshop on Expert Systems and their Applications, Avignon, France. pages 59-78, 1988. Within machine-learning, this dataset was used for the evaluation of HINT (Hierarchy INduction Tool), which was proved to be able to completely reconstruct the original hierarchical model. This, together with a comparison with C4.5, is presented in B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by function decomposition. ICML-97, Nashville, TN. 1997 (to appear) 4. Relevant Information Paragraph: Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX (M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure: CAR car acceptability . PRICE overall price . . buying buying price . . maint price of the maintenance . TECH technical characteristics . . COMFORT comfort . . . doors number of doors . . . persons capacity in terms of persons to carry . . . lug_boot the size of luggage boot . . safety estimated safety of the car Input attributes are printed in lowercase. Besides the target concept (CAR), the model includes three intermediate concepts: PRICE, TECH, COMFORT. Every concept is in the original model related to its lower level descendants by a set of examples (for these examples sets see http://www-ai.ijs.si/BlazZupan/car.html). The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety. Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods. Acknowledgement @misc{Lichman:2013 , author = ""M. Lichman"", year = ""2013"", title = ""{UCI} Machine Learning Repository"", url = ""http://archive.ics.uci.edu/ml"", institution = ""University of California, Irvine, School of Information and Computer Sciences"" }`'",tabular data,UMUC DATA 650 Summer 2019 Competition,inClass,Evaluate the Cars!,categorizationaccuracy,umuc-data-650-summer-2019-competition 953,"'`Os dados so do IGM, e iremos prever se a nota de matemtica de um municpio na prova de matemtica do ENEM est acima ou abaixo da mediana Brasil. As notas sero calculadas de acordo com os seguintes critrios: Nota mais alta na competio Melhor Kernel de anlise exploratria na competio Melhor Kernel de anlise de feature importances Cronograma: 21/05 - Aula 06 - Elaborao de Modelo do Kaggle 28/05 - Aula 07 - Apresentao dos modelos`'",tabular data,IESB - 2019,,,rmse,iesb-2019 954,"'`Introduction Is this homework assignment you will work on a written Language Identification (LI) task, using a database create from Wikipedia paragraphs. Task description Fork the character-based RNN Baseline and create a new notebook with an additional contribution to the analysis, optimization or comparative study of the proposed model and task. Include your comments, results, tables, graphs and conclusions in the notebook. Assigment suggestions Hyperparameter optimization: study the performance of the model as a function of one of its parameters: the embedding size, RNN hidden size, number of layers, batch size, optimizer, learning rate or other optimizer parameters, number of epochs, Other input features: modify the code to use other features as input such as words, character n-grams, character counts, Comparative analysis with other DNN architectures: convolutional neural networks, average-pooling, Description and comparative analysis with other classical LI systems. You can use an existing implementation, but you will have to describe it with your own words in the notebook.`'",text data,Language Identification,inClass,Written language identification of Wikipedia paragraphs,categorizationaccuracy,language-identification 955,"'`Wine (from Latin vinum) is an alcoholic beverage made from grapes, generally Vitis vinifera, fermented without the addition of sugars, acids, enzymes, water, or other nutrients. Wine has been produced for thousands of years. The earliest known traces of wine are from Georgia (cca. 6000 BC), Iran (cca. 5000 BC), and Sicily (cca. 4000 BC) although there is evidence of a similar alcoholic beverage being consumed earlier in China (cca. 7000 BC). The earliest known winery is the 6,100-year-old Areni-1 winery in Armenia. Wine reached the Balkans by 4500 BC and was consumed and celebrated in ancient Greece, Thrace and Rome. Throughout history, wine has been consumed for its intoxicating effects. Wine has long played an important role in religion. Red wine was associated with blood by the ancient Egyptians and was used by both the Greek cult of Dionysus and the Romans in their Bacchanalia; Judaism also incorporates it in the Kiddush and Christianity in the Eucharist. Yeast consumes the sugar in the grapes and converts it to ethanol and carbon dioxide. Different varieties of grapes and strains of yeasts produce different styles of wine. These variations result from the complex interactions between the biochemical development of the grape, the reactions involved in fermentation, the terroir, and the production process. Many countries enact legal appellations intended to define styles and qualities of wine. These typically restrict the geographical origin and permitted varieties of grapes, as well as other aspects of wine production. Wines not made from grapes include rice wine and fruit wines such as plum, cherry, pomegranate and elderberry. This work addresses the following issues concerning the quality of wine with respect to various chemical contents of and acids. The first issue concerns the correlation between different acids and quality of wine. In this work, we investigate what could be the best associate parameter on which best quality wine depends? Overview of the Study Our field study concerns quality of wine , produced all over the world. Wine is from one of the alcohol family and it is considered as a part of rich culture as well. There are various health benifits of wine (http://www.wideopeneats.com/10-health-benefits-get-drinking-daily-glass-wine/). There are a large number of occupations and professions that are part of the wine industry, ranging from the individuals who grow the grapes, prepare the wine, bottle it, sell it, assess it, market it and finally make recommendations to clients and serve the wine. In this study, we figure out important correlated chemical components of wine. Important acids which are associated with quality of wine. We will find out the right components which are needed in right ratio to make quality wine.`'",tabular data,MLDM Classification Competition,inClass,"The dataset is related to red and white variants of the Portuguese ""Vinho Verde"" wine. ",meanfscore,mldm-classification-competition 956,"'`In diesem Semesterprojekt werden Sie Straenschilder lokalisieren und klassifizieren. Dabei werden die folgenden Klassen unterschieden: speed limit 30 (prohibitory) speed limit 50 (prohibitory) speed limit 60 (prohibitory) speed limit 70 (prohibitory) speed limit 80 (prohibitory) restriction ends 80 (other) speed limit 100 (prohibitory) speed limit 120 (prohibitory) no overtaking (prohibitory) no overtaking (trucks) (prohibitory) priority at next intersection (danger) priority road (other) give way (other) stop (other) no traffic both ways (prohibitory) no entry (other) danger (danger) bend right (danger) uneven road (danger) slippery road (danger) construction (danger) traffic signal (danger) school crossing (danger) snow (danger) go right (mandatory) go left (mandatory) go straight (mandatory) go right or go straight (mandatory) keep right (mandatory) roundabout (mandatory) restriction end (overtaking (trucks)) (other) Die Koordinaten-Schwerpunkte liegen in X und Y Pixelkoordinaten vor, die Klassen jeweils noch als Integer-Zahl. In Ihrer Submission-File mssen die X und Y Werte als Float Values zwischen 0 und 1 angegeben werden und die Klassen als categorical (one-hot-encoding).`'",image data,University of applied sciences Mannheim,inClass,"In this competion, students will detect german street signs",ae,university-of-applied-sciences-mannheim 957,"'`Diese freiwillige Semesteraufgabe dient dazu, Ihr erlerntes Wissen zu zeigen. Durch Bearbeiten der Aufgabe knnen Sie sich alleine oder in Gruppen von maximal 3 Personen einen Bonus fr die Prfung am Ende des Semesters erarbeiten. Am Ende der Vorlesungszeit wird ein Termin angesetzt, an dem Sie Ihre Lsung kurz vorstellen, und die beste Lsung mit einem Preis prmiert wird. In der Semesteraufgabe geht es um die Klassifikation des gegebenen Datensatzes. Der Datensatz besteht dabei aus verschiedenen Datentypen: kategorisch kontinuierlich missing values u.v.m. Ihre Aufgabe besteht darin ein Modell zu entwerfen, das die Daten den entsprechenden Klassen zuordnet. Sie knnen den kompletten Datensatz zum Training und Testen verwenden. Bei der Vorstellung Ihres Algorithmus werden wir Ihre Lsung mit neuen, unbekannten Daten testen. Die Ergebnisse der einzelnen Gruppen werden mit Hilfe von Precission, Recall und dem F1-Measure miteinander verglichen.`'",tabular data,IES Data Mining(WS 18/19),inClass,Student Competition,meanfscore,ies-data-mining(ws-18/19) 958,"'`Below is the path on .7 to access the hackathon data. /data/hackathon_drive/ The data set consists of two classes Human & Non-human. Train set consists of 750 images and test set consists of 250 approximately. Leader board metric used is Accuracy - (TP+TN)/(TP+TN+FP+FN) resultscript.pyc can be used to find your algorithms performance. o Command python resultscript.pyc {pathtoyoursubmissionfile} o The submission file MUST be is the same format as the sample_submission.csv. The Public score is calculated on 50% of the test data and final standings will be on the other 50% 20 Submissions are possible per day.`'",image data,JAMP Hackathon Drive 1,inClass,Computer Vision - Binary Classification,categorizationaccuracy,jamp-hackathon-drive-1 959,"'`Airbnb(price) RMSE AirbnbKaggle1CPU(1Core)+1GPU Private Leaderboard submit20`'",tabular data,HEROZ Internal Competition,inClass,HEROZ社内コンペティション,rmse,heroz-internal-competition 961,"'`Update: this competition has been cancelled on account of the COVID-19 pandemic. As a result of the continued collaboration between Google Cloud and the NCAA, the seventh annual Kaggle-backed March Madness competition is underway! Another year, another chance to anticipate the upsets, call the probabilities, and put your bracketology skills to the leaderboard test. Kagglers will join the millions of fans who attempt to forecast the outcomes of March Madness during this year's NCAA Division I Mens and Womens Basketball Championships. But unlike most fans, you will pick your bracket using a combination of NCAAs historical data and your computing power, while the ground truth unfolds on national television. In the first stage of the competition, Kagglers will rely on results of past tournaments to build and test models. We encourage you to post any useful external data as a dataset. In the second stage, competitors will forecast outcomes of all possible matchups in the 2020 NCAA Division I Mens and Womens Basketball Championships. You don't need to participate in the first stage to enter the second. The first stage exists to incentivize model building and provide a means to score predictions. The real competition is forecasting the 2020 results. As the official public cloud provider of the NCAA, Google is proud to provide a competition to help participants strengthen their knowledge of basketball, statistics, data modeling, and cloud technology. As part of its journey to the cloud, the NCAA has migrated 80+ years of historical and play-by-play data, from 90 championships and 24 sports, to Google Cloud Platform (GCP). The NCAA has tapped into decades of historical basketball data using BigQuery, Cloud Spanner, Datalab, Cloud Machine Learning and Cloud Dataflow, to power the analysis of team and player performance. The mission of the NCAA has long been about serving the needs of schools, their teams and students. Google Cloud is proud to support that mission by helping the NCAA use data and machine learning to better engage with its millions of fans, 500,000 student-athletes, and more than 19,000 teams. Game on! This page is for the NCAA Division I Men's tournament. Check out the NCAA Division I Women's tournament here. If you want to extend your analysis then try out our Analytics Competition here`'",tabular data,Google Cloud & NCAA ML Competition 2020-NCAAM,,,logloss,google-cloud-&-ncaa-ml-competition-2020-ncaam 962,"'`Update: this competition has been cancelled on account of the COVID-19 pandemic. As a result of the continued collaboration between Google Cloud and the NCAA, the seventh annual Kaggle-backed March Madness competition is underway! Another year, another chance to anticipate the upsets, call the probabilities, and put your bracketology skills to the leaderboard test. Kagglers will join the millions of fans who attempt to forecast the outcomes of March Madness during this year's NCAA Division I Mens and Womens Basketball Championships. But unlike most fans, you will pick your bracket using a combination of NCAAs historical data and your computing power, while the ground truth unfolds on national television. In the first stage of the competition, Kagglers will rely on results of past tournaments to build and test models. We encourage you to post any useful external data as a dataset. In the second stage, competitors will forecast outcomes of all possible matchups in the 2020 NCAA Division I Mens and Womens Basketball Championships. You don't need to participate in the first stage to enter the second. The first stage exists to incentivize model building and provide a means to score predictions. The real competition is forecasting the 2020 results. As the official public cloud provider of the NCAA, Google Cloud is proud to provide a competition to help participants strengthen their knowledge of basketball, statistics, data modeling, and cloud technology. As part of its journey to the cloud, the NCAA has migrated 80+ years of historical and play-by-play data, from 90 championships and 24 sports, to Google Cloud Platform (GCP). The NCAA has tapped into decades of historical basketball data using BigQuery, Cloud Spanner, Datalab, Cloud Machine Learning and Cloud Dataflow, to power the analysis of team and player performance. The mission of the NCAA has long been about serving the needs of schools, their teams and students. Google Cloud is proud to support that mission by helping the NCAA use data and machine learning to better engage with its millions of fans, 500,000 student-athletes and more than 19,000 teams. Game on! This page is for the NCAA Division I Women's tournament. Check out the NCAA Division I Men's tournament here. If you want to extend your analysis then try out our Analytics Competition here`'",tabular data,Google Cloud & NCAA ML Competition 2020-NCAAW,,,logloss,google-cloud-&-ncaa-ml-competition-2020-ncaaw 963,"'`Airbnb(price) RMSE AirbnbKaggle1CPU(1Core)+1GPU Private Leaderboard submit20`'",tabular data,HEROZ Internal Competition Extra2,inClass,HEROZ社内コンペ延長線(fixed),rmsle,heroz-internal-competition-extra2 964,"'`Feature EngineeringBuild Models SubmissionNotebooks csvKernelnotebooksubmit KernelPublic Kernel Forum Rules`'",tabular data,Homework for Students,inClass,Happy modeling!,auc,homework-for-students 965,"'`Overview This is the in-class final project of Math 10 at UC Irvine for the Winter 2020 quarter. We will use the techniques learned in class to classify the hiragana characters in Japanese. You may use any set of techniques that you like mentioned in the project page on Canvas. Data The dataset we will use is Kuzushiji-MNIST, which is similar to the handwritten digits dataset MNIST. Even though in the evaluation, we are using the test data, but the set has been permuted randomly, so a direct copy of the testing label file will not work. Please follow the starter code on the Canvas page/Github repo to import the data to a notebook, and generate a sample answer see if you can get the system to grade your submission. Teams Each team will consist at most 3 members, the selection will be randomly generated weighing the midterm after the free team sign-up phase, and will be available on Canvas group function in People tab. Submission Please refer to Rules: you can upload .csv files generated by the starter code on the Canvas page/Github repo to this competition. The submission is then cross-validated by using about 50% of the testing dataset. I upload a benchmark solution myself using a 5-nearest neighbor algorithm, which achieves about 90% accuracy on the testing dataset. Your solution should be at least on par with this benchmark. Acknowledgements We would like to thank Tarin Clanuwat for allowing us to use this dataset.`'",image data,UC Irvine Math 10 Winter 2020,inClass,Can your algorithm read ancient Japanese literature?,categorizationaccuracy,uc-irvine-math-10-winter-2020 966,"'`The Challenge The competition is simple: use machine learning to create a model that predicts Global Horizontal Solar Irradiance (GHI) from a set of features. Please see the ""Rules"" tab before participating. Overview of How Kaggles Competitions Work Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. Get to Work Download the data, build models on it locally or on Kaggle Kernels (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. Check the Leaderboard See how your model ranks against other competitors on the leaderboard. Improve Your Score Check out the Kaggle discussion forum to find lots of tutorials and insights from other competitors. What Data Will I Use in This Competition? In this competition, youll gain access to two similar datasets -One dataset is titled train.csv and the other is titled test.csv. Using the patterns you find in the train.csv data, predict the target variable -Global Horizontal Solar Irradiance on data found in test.csv . Check out the Data tab to explore the datasets even further. Once you feel youve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other competitors. How to Submit your Prediction to Kaggle Once youre ready to make a submission and get on the leaderboard: Click on the Submit Predictions button Upload a CSV file in the submission file format as highlighted in the Evaluation Page. Youre able to submit maximum 3 submissions a day. Acknowledgement Primary source for the Dataset-Bangladesh - Solar Radiation Measurement Data The campaign that has generated this data was commissioned by The World Bank with funding from the Energy Sector Management Assistance Program (ESMAP). The data is made freely available under The World Bank's open data policy. Site Name : BDFE2 (Feni) Equipment: Helioscale omega station (Tier 1) Host Institution: Char Darbesh Adarsha Gram Government Primary School Elevation (m): 5 Latitude (positive North decimal degrees):22.80029 Longitude (positive East decimal degrees):91.35819`'",tabular data,"IEEE PES BDC DataThon , Year-2020",inClass,What is the value of solar radiation?,rmse,"ieee-pes-bdc-datathon-,-year-2020" 967,"'`Purpose of this event The International Symposium on Semiconductor Production Technology (International Symposium on Semiconductor Manufacturing: ISSM) will hold the ISSM AI Technology Contest 2020 this year to utilize AI for semiconductor manufacturing site data in order to develop artificial intelligence technology (AI) in the field of semiconductor manufacturing equipment. The purpose of this contest is to broaden the base of practical research and development using real data generated at semiconductor manufacturing equipment sites. This contest is for the ""defective particle classification in SEM images"" algorithm development, which is essential for improving the yield of semiconductor manufacturing. Participants in this contest will create a learning model that automatically classifies 4,000 particle SEM images used in actual semiconductor manufacturing into the specified category. The First ISSM AI Solution Contests To Revolutionalize Semiconductor Manufacturing International Symposium on Semiconductor Manufacturing: ISSM) (AI)ISSM AI 2020AI SEM4,000SEM ISSM AI2020 What skills do we want? In addition to basic image recognition and classification methods, it is necessary to solve issues that occur during actual operation at semiconductor manufacturing sites, such as imbalanced data, unclear classes, and micro-defects. Award Based on the correct answer rate of the classification, the best prize and the technical prize are awarded. A certificate of commendation and a prize will be awarded to the awarding team. Award Ceremony The team of outstanding players will be honored at ISSM 2020 (on December 15-16, 2020). ISSM 202020201215~16ONLINE`'",image data,ISSM2020 AI Challenge,inClass,SEM image classification competition,categorizationaccuracy,issm2020-ai-challenge 968,"'`Topic: Image classification when no labels are given Dataset: MNIST (basic), CIFAR10 (advanced) Description: In visual recognition tasks, such as image classification, unsupervised learning exploits cheap unlabeled data and can help to solve these tasks more efficiently. The dataset of such tasks only includes features x. There is no label y. The basic idea of unsupervised learning is feature-extraction and clustering. Students are required to apply or design some feature-extraction methods to get appropriate representations from provided images, and then use clustering methods to distinguish images according to the distance among representations. This project is evaluated based on the test accuracy. We encourage students to compare different methods.`'",,UCSC CSE142 Project 4: Basic,,,categorizationaccuracy,ucsc-cse142-project-4:-basic 969,"'`Project Overview In this project, you will apply supervised learning techniques and an analytical mind on data collected for the U.S. census to help CharityML (a fictitious charity organization) identify people most likely to donate to their cause. You will first explore the data to learn how the census data is recorded. Next, you will apply a series of transformations and preprocessing techniques to manipulate the data into a workable format. You will then evaluate several supervised learners of your choice on the data, and consider which is best suited for the solution. Afterwards, you will optimize the model you've selected and present it as your solution to CharityML. Finally, you will explore the chosen model and its predictions under the hood, to see just how well it's performing when considering the data it's given. You can find a full description of the dataset available here. Project Highlights This project is designed to get you acquainted with the many supervised learning algorithms available in sklearn, and to also provide for a method of evaluating just how each model works and performs on a certain type of data. It is important in machine learning to understand exactly when and where a certain algorithm should be used, and when one should be avoided. Things you will learn by completing this project: How to identify when preprocessing is needed, and how to apply it. How to establish a benchmark for a solution to the problem. What each of several supervised learning algorithms accomplishes given a specific dataset. How to investigate whether a candidate solution model is adequate for the problem. Success Metrics For the optional competition portion of this project, you may choose to use your machine learning algorithm to predict on our test set. This will give you some additional insight into how Kaggle competitions work, and how to create your own profile on Kaggle! The success of your model will be determined based on your models AUC or area under the curve associated with ROC curves. For more information on how this is calculated see the documentation here.`'",,Udacity ML Charity Competition,,,auc,udacity-ml-charity-competition 970,"'`Uwaga: Ostateczny termin oddania raportu commitunarepozytorium to 3 dni po zakoczeniu konkursu. Wstp W stacji pomiarowej nad Batykiem prowadzony jest monitoring ornitologiczny. Zapisywane s nagrania w celu ledzenia nocnych migracji ptakw. Materiaem badawczym s godziny nagra, na ktrych prbkami pozytywnymi dwikamiptakw jest zaledwie okoo 1% nagrania. Dlatego istnieje potrzeba automatycznej detekcji gosw ptakw, aby oszczdzi badaczom godzin wsuchiwania si w szum. Zaproponowana tematyka jest zainspirowana prezentacj na seminarium GMUM mgr in. Hanny Pamuy z Katedry Mechaniki i Wibroakustyki na AGH, ktra dostarczya dane do zadania. Za Pastwa pozwoleniem najciekawsze rozwizania zostan jej przekazane do rozwaenia w dalszych badaniach. Dane Danymi s nagrania w postaci plikw w formacie WAV oraz pliki tekstowe z oznaczonymi momentami, w ktrych na nagraniach wystpuje dwik. Popularnym podejciem do przetwarzania dwiku jest analizowanie jego spektrogramu i uycie metod przetwarzania obrazu, ktre s lepiej rozwinite od metod pracujcych bezporednio na dwiku. Opis plikw z danymi znajduje si w zakadce Data. Zaczony jest rwnie kod, ktry wczytuje pliki WAV oraz tworzy spektrogramy. Do wstpnego przetwarzania danych mona zarwno uy zaczony kod, jak rwnie stworzy wasn reprezentacj na potrzeby zadania. Naley wzi pod uwag fakt, e podane etykiety mog mie minimalny, kilkumilisekundowy bd w oznaczeniach czasowych. Problemy Zbioru Danych Rne warunki pogodowe i zakcenia. wierszcze grajce na niszych czstotliwociach uniemoliwiajce wykrycie gosu piewaka. Rne parametry gosu w zalenoci od odlegoci gosu od mikrofonu. Niewielka liczba gosw referencyjnych. Podejcie do Zadania Dane: spor cz pracy naley powici danym; nagrania wraz z etykietami s zbiorem danych bardzo surowym w porwnaniu do zbiorw, na ktrych pracowalimy do tej pory; polecam powici czas na preprocessing klasycznymi metodami analizy dwiku/obrazu np.odszumianie,redukcjawymiarowoci oraz na sensown augmentacj np.losowyszum,przesunicieczstotliwoci. Architektura: dozwolone jest uycie dowolnej architektury poznanej na zajciach i pochodnych, w tym rne rodzaje konwolucji, sieci konwolucyjne na wzr architektur typu ResNet, sieci rekurencyjne LSTM lub GRU; kreatywne podejcie do inputu np.czeniespektrogramuzdanymidwikowymi moe by warte rozwaenia. Ocenianie Rozwizanie jest oceniane na podstawie wyniku AUC modelu, o ktrym wicej mona przeczyta w zakadce Ewaluacja. Maksymalnie za zadanie mona uzyska 20 punktw + maksymalnie 3 punkty bonusowe. Ostateczna liczba punktw bdzie wyliczona wzorem: \[ P = \begin{cases} 10\frac{W - B}{M - B} + 10 + D, &\quad\text{jeeli } W \geq B\\ \max(10\frac{W - 0.5}{B - 0.5}, 0), &\quad \text{w p.p.} \end{cases} \] gdzie: W - wynik AUC uczestnika B - wynik AUC baseline'u M - najwyszy wynik AUC w konkursie D - uzyskane punkty dodatkowe Warunkiem koniecznym uzyskania punktw dodatkowych jest stworzenie raportu rozwizania. Punkty dodatkowe mona uzyska za pierwsze miejsca w rankingu, ale te za ciekawe raporty, ktre prezentuj czstkowe wyniki pozwalajce ulepszy model predykcyjny nawetjelicaociowymodelniejestnajwyszymwrankingu. Wicej informacji w zakadce Raport. W razie wtpliwoci zachcam do zadawania pyta w zakadce Discussion. Mona te kontaktowa si ze mn mailowo: ***.`'",,UJ SN2019 Zadanie 2: Nocne Ptasie Wdrwki,,,auc,uj-sn2019-zadanie-2:-nocne-ptasie-wdrwki 971,"'`At Shopee, we strive to ensure fairness to both buyers and sellers, and improve user experience by identifying and discouraging negative behaviour. Listing quality is a major area where poor behaviours often occur. Every transaction on Shopee starts from a product listing. In order to get more sales, sellers may commit certain behaviour to increase their listings exposure and gain an unfair advantage over other shops. An example of such behaviour is keyword spam, whereby sellers input irrelevant keywords in the listing title that do not accurately describe the products they are selling. For instance, the product title claims that the listing is for pants, shirt, shoes, while the item that is actually being sold is just a pair of pants. Sellers do this in the hope that when buyers search for shirt or shoes, their listings would also appear in the search result. This behaviour of spamming irrelevant keywords in the title may confuse search engine and could affect the accuracy of search results, and therefore result in a poor user experience. Therefore, it is important to identify, punish and deter such behaviour from existing on Shopee. However, at the same time, we also need to consider the case that sellers input multiple product keywords in the listing title but those keywords are relevant to the products. An example is that the underlying product is a pair of shoes, and seller describes it in the listing title as ""shoes, sneakers"". In this case, the seller is trying to increase their search exposure, but does not use a misleading product title, and therefore should not be penalized. While it is important to deter negative behaviour, it is also very important to avoid wrongly discouraging positive behaviour. Task: Using the keyword directory, identify the product groups that are present in the product title. Example: Keyword list: Group: 0, Keywords: jacket Group: 1, Keywords: windbreaker, raincoat Product title: Index: 0, Name: red jacket windbreaker Since product title contains keywords from both group 0 and 1. --> groups found: [0,1] Input 1.Extra Material 2 - keyword list_with substring.csv: List of product keywords, separated into product groups. Each row is a product group. The same keyword may appear in multiple groups (eg. notebook) Some of the keywords are substrings of other keywords. In this case, the longer word should take priority over the substring. 2.Keyword_spam_question.csv: File containing product name that you need to extract the product keyword groups. Further Details You will be given a directory of product keywords, organized into keyword groups. The .csv file provided will have 2 columns: Group: arbitrary index of the product keyword grouping Keywords: product keyword. Keywords on the same row denote words that can refer to the same product, and therefore should be considered the same product type (eg. raincoat and windbreaker can refer to the same product) Keywords on different rows denote words that refer to different product types (eg. shirt and raincoat refer to different product types) One keyword may appear in multiple groups (eg. notebook could refer to a computing product or a stationary) Note: you do not need to look into the correctness of the grouping, and should use it as-is. Using the keyword directory, you need to identify the product groups that are present in the product title. If 2 product groups are both equally presentable in the result, choose the group with the smaller index. Eg 1: White netbook, ultrabook and gaming mousepad should contain product groups [77, 85], because keyword netbook is in group 77; keyword ultrabook is also in group 77; keyword gaming mousepad is in group 85. Eg 2: Beautiful red notebook shirt jeans should contain product groups [6, 29, 77], because keyword notebook is in groups 77 and 204; keyword shirt is in group 29; keyword jeans is in group 6. Since using group 77 or group 204 will both result in 3 product groups, we will choose group 77 due to the smaller index. Eg 3: Printer toner wallpaper ink should contain product groups [81, 182], because keyword Printer toner is in group 81; keyword wallpaper is in group 182. Even though keyword Printer is in another group (79), it is a substring of Printer toner. Therefore 'Printer toner' takes priority over 'Printer'. Submit Format A csv file (utf-8 encoding) containing 2 columns: index : index of the item in the Keyword_spam_question.csv file groups_found : list of groups that are found in the corresponding item title, sorted ascending. Group names should be according to Extra Material 2 - keyword list_with substring.csv file. If 2 product groups are both equally presentable in the result, choose the group with the smaller index. Example: index groups_found 0 [77] 1 [216, 217] 2 [216, 218, 221] Your submission should have 800000 rows, each with 2 columns. Tips: 1) You are advised to run your tests on a sample of the dataset first. 2) If you are unable to solve the entire problem within the time limit, create the output csv with the required number of columns and rows based on a subset of the problem first. Teams which do not make a successful submission for both rounds of the competition will not be considered for the overall ranking.`'",,[Undergraduate] I'm the Best Coder! Challenge 2019,,,categorizationaccuracy,[undergraduate]-im-the-best-coder!-challenge-2019 972,"'`Fraud Detection Fraudsters create fake transactions to boost sales/shop ratings. Fake transactions are defined as transactions where the buyer and seller are the same individual (in reality). To help Shopee tackle this issue, you are expected to detect these fake transactions from normal transactions. Sample data for transactions and users' details will be provided. Task Find fake orders where the buyer and the seller are directly or indirectly linked, by any of the following links: Device, Credit Card, Bank Account. Direct link: the buyer and the seller share the same details. Indirect link: the buyer and the seller are not directly linked, but users who share the same details as them share details with one another: e.g. buyer - user A - user B - user C - - user Z - seller Basic Concepts Each userid represents a distinct user on Shopee. Each orderid represents a distinct transaction on Shopee. Device, Credit Card, Bank Account data is encrypted to preserve data privacy. Each distinct value represents a unique entity. Examples Example 1 orderid: 1955598428, buyer userid: 35545436, seller userid: 70763052. The buyer has this device: ""/3TLpeou8xXsNxpACFFKr34Kqqwxiu5Hi1keJ6plk5E="". The seller also has this device: ""/3TLpeou8xXsNxpACFFKr34Kqqwxiu5Hi1keJ6plk5E="". Therefore, we consider that the buyer and the seller are directly linked by device. This order is a fraud order by definition. Example 2 orderid: 1953543830, buyer userid: 223406364, seller userid: 193350172. User 223406364 is directly linked to user 227839480 by sharing the same device ""7q1zwUrfP8+09Z+EPh+YyNYTwxhHW7wfGuIFWhRE490="". User 227839480 is directly linked to user 193350172 by sharing the same device ""IkGjfHwwIGYxZ4WkM30COPKkmALyJfSSODpNTTPuMyS="". Therefore, buyer (userid: 223406364) is indirectly linked to seller (userid: 193350172). This order is a fraud order by definition. Submit Format Two columns required: orderid. is_fraud: assign value 1 if the order is fraud, otherwise 0. Example orderid is_fraud 1953277092 0 1952278092 1 Your submission should have 620947 rows, each with 2 columns. Tips: 1) You are advised to run your tests on a sample of the dataset first. 2) If you are unable to solve the entire problem within the time limit, create the output csv with the required number of columns and rows based on a subset of the problem first. Teams which do not make a successful submission for both rounds of the competition will not be considered for the overall ranking.`'",,[Undergrad] I'm the Best Coder! Challenge 2019,,,matthewscorrelationcoefficient,[undergrad]-im-the-best-coder!-challenge-2019 973,"'`Welcome to the In-class NLP competition for Deloitte USI. This is an in-house competition curated to give you a hands-on experience of the Natural Language Processing Procedure. Through the current competition, we would like you to apply the methods and techniques taught in class on the dataset provided. Through the course, we give you the opportunity to challenge and test the skills you pick up. This competition is for learning purposes and prize money for winners has not been decided yet. Please note that this is a limited-participation competition. Only invited users from Deloitte USI may participate. In order to get a certificate for NLP Practicum course, you must submit at least one submission before deadline. Don't cheat! Apply yourself! Have fun! if you attended the NLP Practicum course and you are facing any problem in this competition, Please contact Vikas Kumar (vikkumar@deloitte.com)`'",,USI NLP Practicum,inClass,InClass NLP Competition by ML Guild for Deloitte USI,categorizationaccuracy,usi-nlp-practicum 974,"'`Introduction Is this homework assignment you will create, analyze and evaluate word vectors, using a database created from the Catalan Wikipedia. Task description You will have to modify the provided baseline notebooks to create a new notebook with an additional contribution to the analysis, optimization or comparative study of the baseline model and word vectors. Assignment 1 (word vectors) Improve CBOW model Position-dependent Weighting. The standard CBOW model (CBOW Training notebook) sums all the context word vectors with the same weight. Implement and evaluate a weighted sum of the context words with: a) A fixed scalar weight, e.g, (1,2,2,1) to give more weight to the words that are closer to the predicted central word b) A trained scalar weight for each position c) A trained vector weight for each position. Each word vector is element-wise multiplied by the corresponding position-dependent weight and then added with the rest of the weighted word vectors. d) (Optional) Hyperparameter optimization: study the performance of the model as a function of one of its parameters: the embedding size, batch size, optimizer, learning rate/scheduler, number of epochs, sharing input/output embeddings. e) (Optional) Implement other methods to obtain word embeddings Evaluate the performance of the improved Catalan word vectors a) Implement the WordVector class (Word Vector Analysis notebook) with the most_similar and analogy methods to find closest vectors and analogies using the cosine similarity measure. b) Intrinsic evaluation: perform an informal evaluation finding good and bad examples of closest words and analogies. You can analyze the behavior of the CBOW word vectors for words with multiple meanings, synonyms and antonyms, word frequency, different types of analogies, bias (gender, race, sexual orientation, etc.). c) Prediction accuracy: compare the accuracy of the implemented CBOW models in the out-of-domain (el Peridico) test set (prediction of the central word given a context of 2 previous and 2 next words). To obtain your score on the competition test set you have to commit your version of the CBOW Training notebook, wait until the training completes, and then ""Submit to competition"" the obtained file (submission.csv) in the Output section of the notebook. d) (Optional) Visualize word analogies or word clustering properties Assignment 2 (language modeling) Improve the prediction accuracy of the baseline TransformerLayer notebook Suggestions: Increase the number of TransformerLayers (2 o more) Multilayer perceptron (MLP) over the concatenated input vectors TransformerLayer with multihead attention Replace the last linear layer and the softmax part of the loss with AdaptiveSoftmax Sharing input/output embeddings Hyperparameter optimization: embedding size, batch size, pooling layer (mean, max, first, ), optimizer, learning rate/scheduler, number of epochs, etc. The report should include the modified source code, a simple schematic drawing of the model, your results and conclusions. Create a comparative table of the studied models respect to the single-layer transformer baseline, including loss, accuracy, training time, number of parameters and hyperparameters differences with at least 4 new models or hyperparameters.`'",,Word vectors,inClass,Catalan word vectors,categorizationaccuracy,word-vectors 975,"'`Welcome to the competition. This competition serves as the final group project for students in the Spring 2020 delivery of ISYS 4293 Business Intelligence, though others have been invited to keep the competition level up. Competition Description Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. All Submissions Subject to Verification Note that although this dataset is based on the Dean De Cock work, the dataset has been modified substantially for use in this competition. This has been done to encourage creativity, student curiosity and inventiveness rather than their ability to google solutions. Acknowledgements The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.`'",,WCOB Spring 2020,inClass,See if you can predict the sale price of houses,rmse,wcob-spring-2020 976,"'`Competition description The goal of this competition is the prediction of the price of diamonds based on their characteristics (weight, color, quality of cut, etc.).`'",,WhiteBox ML in-company training competition,inClass,Predict diamond prices using Machine Learning models,rmse,whitebox-ml-in-company-training-competition 977,"'`Google Cloud and NCAA have teamed up to bring you this years version of the Kaggle machine learning competition. Another year, another chance to anticipate the upsets, call the probabilities, and put your bracketology skills to the leaderboard test. Kagglers will join the millions of fans who attempt to forecast the outcomes of March Madness during this year's NCAA Division I Mens and Womens Basketball Championships. But unlike most fans, you will pick your bracket using a combination of NCAAs historical data and your computing power, while the ground truth unfolds on national television. In the first stage of the competition, Kagglers will rely on results of past tournaments to build and test models. We encourage you to post any useful external data as a dataset. In the second stage, competitors will forecast outcomes of all possible match-ups in the 2018 NCAA Division I Mens and Womens Basketball Championships. You don't need to participate in the first stage to enter the second. The first stage exists to incentivize model building and provide a means to score predictions. The real competition is forecasting the 2018 results. This page is for the NCAA Division I Women's tournament. Check out the NCAA Division I Men's tournament here.`'",,Google Cloud & NCAA ML Competition 2018-Women's,,,logloss,google-cloud-&-ncaa-ml-competition-2018-womens 978,"'`As a result of the continued collaboration between Google Cloud and the NCAA, the sixth annual Kaggle-backed March Madness competition is underway! Another year, another chance to anticipate the upsets, call the probabilities, and put your bracketology skills to the leaderboard test. Kagglers will join the millions of fans who attempt to forecast the outcomes of March Madness during this year's NCAA Division I Mens and Womens Basketball Championships. But unlike most fans, you will pick your bracket using a combination of NCAAs historical data and your computing power, while the ground truth unfolds on national television. In the first stage of the competition, Kagglers will rely on results of past tournaments to build and test models. We encourage you to post any useful external data as a dataset. In the second stage, competitors will forecast outcomes of all possible matchups in the 2019 NCAA Division I Mens and Womens Basketball Championships. You don't need to participate in the first stage to enter the second. The first stage exists to incentivize model building and provide a means to score predictions. The real competition is forecasting the 2019 results. As the official public cloud provider of the NCAA, Google Cloud is proud to provide a competition to help participants strengthen their knowledge of basketball, statistics, data modeling, and cloud technology. As part of its journey to the cloud, the NCAA has migrated 80+ years of historical and play-by-play data, from 90 championships and 24 sports, to Google Cloud Platform (GCP). The NCAA has tapped into decades of historical basketball data using BigQuery, Cloud Spanner, Datalab, Cloud Machine Learning and Cloud Dataflow, to power the analysis of team and player performance. The mission of the NCAA has long been about serving the needs of schools, their teams and students. Google Cloud is proud to support that mission by helping the NCAA use data and machine learning to better engage with its millions of fans, 500,000 student-athletes and more than 19,000 teams. Game on! This page is for the NCAA Division I Women's tournament. Check out the NCAA Division I Men's tournament here.`'",,Google Cloud & NCAA ML Competition 2019-Women's,,,logloss,google-cloud-&-ncaa-ml-competition-2019-womens 979,"'`Natural resource managers responsible for developing ecosystem management strategies require basic descriptive information including inventory data for forested lands to support their decision-making processes. However, managers generally do not have this type of data for inholdings or neighboring lands that are outside their immediate jurisdiction. One method of obtaining this information is through the use of predictive models.`'",,Milestone 1-Linear Models,inClass,How accurately can you predict wildfire ignition points?,rmse,milestone-1-linear-models 980,'`Insert text here. Please use markdown.`',,Wy4Vnuumx7y294p,inClass,[200217] Pingpong AI Research - 신재민님,categorizationaccuracy,wy4vnuumx7y294p 981,"'`Problem statement In this competition, you will work with a dataset of credit card users from one of Russian banks. One of the decisions that the bank must make when issuing a credit card is to how large a limit to provide. The larger the limit is, the higher is the potential loss for bank in case of default, but the higher is the potential profit if the client is good and makes a lot of transactions. So it is important to predict probability of default and credit card turnover to make the decision about the limit. This competiotion is focused on predicting the credit card turnover. More exactly, the target variable is average turnover per month during the first year of credit card usage. It should be predicted based on a few simple features known to the bank at the moment of application, such as the region, age and income of the potential client. Due to the limited number of input features, this problem is more of a toy kind. But it can give a good experience of working with numeric and categorical features and checking the results of data analysis with intuition.`'",,Credit card turnover prediction,inClass,Use simple data from credit card applications to predict how much the clients would spend per month,rmsle,credit-card-turnover-prediction 982,"'`Zillows Zestimate home valuation has shaken up the U.S. real estate industry since first released 11 years ago. A home is often the largest and most expensive purchase a person makes in his or her lifetime. Ensuring homeowners have a trusted way to monitor this asset is incredibly important. The Zestimate was created to give consumers as much information as possible about homes and the housing market, marking the first time consumers had access to this type of home value information at no cost. Zestimates are estimated home values based on 7.5 million statistical and machine learning models that analyze hundreds of data points on each property. And, by continually improving the median margin of error (from 14% at the onset to 5% today), Zillow has since become established as one of the largest, most trusted marketplaces for real estate information in the U.S. and a leading example of impactful machine learning. Zillow Prize, a competition with a one million dollar grand prize, is challenging the data science community to help push the accuracy of the Zestimate even further. Winning algorithms stand to impact the home values of 110M homes across the U.S. In this million-dollar competition, participants will develop an algorithm that makes predictions about the future sale prices of homes. The contest is structured into two rounds, the qualifying round which opens May 24, 2017 and the private round for the 100 top qualifying teams that opens on Feb 1st, 2018. In the qualifying round, youll be building a model to improve the Zestimate residual error. In the final round, youll build a home valuation algorithm from the ground up, using external data sources to help engineer new features that give your model an edge over the competition. Because real estate transaction data is public information, there will be a three-month sales tracking period after each competition round closes where your predictions will be evaluated against the actual sale prices of the homes. The final leaderboard wont be revealed until the close of the sales tracking period.`'",,Zillow Prize: Zillows Home Value Prediction (Zestimate),,,custom metric,zillow-prize:-zillows-home-value-prediction-(zestimate) 983,"'`Objective Price House Price Awards Top 1 and 2: Starbucks coffee from Shawn, Chen Metrics RMSE https://www.statisticshowto.datasciencecentral.com/rmse/ Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.`'",,goagoagoa2,,,rmse,goagoagoa2 984,"'` , . . RMSLE. can_buy - , can_promote - , category - , contacts_visible - , date_created - , delivery_available - , description - , fields - , id - , images - id , location - , mortgage_available - , name - , payment_available - , price - , subcategory - , subway - ,`'",,Hackathon SF ML 3,inClass,Предсказание цены для объявления,rmsle,hackathon-sf-ml-3 985,"'`Summary This competition focuses on the problem of forecasting future web traffic for Wikipedia articles. The dataset consists of 45 time series representing an aggregated number of daily views for multiple Wikipedia articles, starting from July, 1st, 2015 up until August 2Oth, 2017. The goal of this competition is to produce the best forecasts for the period starting from August, 21st, 2017 up until September 10th, 2017. The full training set is available to you. You can evaluate your predictions by submitting them to the Kaggle website. Note that only 40% of the test set is used to compute your public score. Your final private score using the full test set will be provided at the end of the competition. Important: You are not allowed to use any other external information/data to build your model and produce your predictions. The evaluation metric for this competition is the symmetric mean absolute percentage error SMAPE: \[ \text{SMAPE}(y_t, \hat{y}_t) = \frac{100}{n_{\text{test}}} \sum_{t = 1}^{n_{\text{test}}} \frac{|y_t - \hat{y}_t|}{(|y_t| + |\hat{y}_t|)/2}, \] where ntest is the number of data points in the test set; y t is the forecast; yt is the true value. We define SMAPE0,0 = 0. Submission Format For each series and forecast horizon, you must produce a value seeNotebook. The file should contain a header and have the following format: Id, forecasts s1h1, 0 s1h2, 0 s1h3, 0 s1h4, 0 s1h5, 0 etc. s1h3 means series 1, forecast horizon 3 Tasks Each student should create a Kaggle account Each group should form a team on Kaggle. Predict the test set, and upload your predictions to Kaggle. See Notebook ""Starter"" (https://www.kaggle.com/bsouhaib/starter) Project report The data analysis report can be a maximum of 5 pages, and must abide by the section structure described below. Section 1: Introduction The introduction will describe the data set forexampleusingvisualization and motivate the problem. It should be brief. Section 2: Methodology This section describes the models and methods you have used, including a justification of your choices. You should also present your model fitting, diagnostics, etc. Specifically, you should discuss the forecast performance of at least one method from each of the following approaches: Statistical time series methods e.g.arma,sarima,etc Regression-based methods recursiveordirect with machine learning algorithms e.g.linear,boosting,nearestneighbours,etc Deep Neural Network architectures In other words, you should discuss at least three methods in total. Section 3: Results and Discussion This includes for example graphs and tables, as well as a discussion of the results. Section 4: Conclusion This includes summary of the findings. You should clearly explain what you have done, using figures to supplement your explanation. Your figures must be of proper size with labeled, readable axes. In general, you should take pride in making your report readable and clear. You will be graded both on quality of content, presentation and code. Deadlines December 19, 11:55pm: At least one Kaggle submission needs to have been made. January 15, 11:55pm: The Kaggle competition closes. January 22, 11:55pm: Upload to Moodle your project report and code, one per group.`'",,Hands-On-AI-UMONS-2019,inClass,Web Traffic Forecasting,smape,hands-on-ai-umons-2019 986,"'`Welcome! In this in-class competition, the task is to predict the HDB resale price based on features like town flat type address floor area flat model This is based on the data downloaded from https://data.gov.sg on 9 July 2019 (the data covers resale transactions from 1990 up to 30 Jun 2019). Welcome! In this in-class competition, the task is to predict the HDB resale price based on features like town flat type address floor area flat model This is based on the data downloaded from data.gov.sg You may frame this task as a regression problem or a time series forecast problem. In order to reduce overhead, we have provided some boilerplate code for us to get started and submit predictions based on blending of simple regressors . See starter-notebook under 'Kernels'. Resources Hosted Jupyter notebook Free GPU Practice skills Feature engineering Explore new libraries like CatBoost Explore bagging, gradient boosting, and other linear, deep learning, tree-based methods Toy with model ensembling/blending Experiment with hyper-parameter search Explore use of entity embeddings Sharing is caring Sharing your kernels is greatly encouraged! Acknowledgements Photo by Amos Lee on Unsplash`'",,HDB Resale Price Prediction,,,rmse,hdb-resale-price-prediction 987,"'`Welcome to the Health Hackers Malaria Challenge! Introduction Malaria is a serious disease that has cost countless lives and is particularly devastating in several countries in Africa. This challenge focuses on the detection of the malaria parasite in blood smear images. As it is particularly prevalent in countries where access to experts and special equipment might be limited, this challenge will focus on building a robust and efficient classifier. The Data The data used for this challenge has been provided by the NIH. It was randomly shuffled into training and test data but not modified in any way. Note, that due to the evaluation of the challenge, there is no point in matching images to available ground-truth labels, because we will want to see your Kaggle kernel at the end. You will need to present your Kernel at the final meeting to be eligible for the prize. The Challenge This challenge will be organized in part in real life meetings. Be sure to join the kick off event which you can find on meetup. At the end of the challenge (on the last day) we will have a final event, where we will compute the final scoring, have a little prize ceremony and exchanges on the solutions and the challenge. How to join Everybody can join this challenge. Please contact pablo@healthhackers.de for an invite or attend the kickoff event. **Note that the prize is only awarded to participants of the final event on July 26 at the Health Hackers summer party at Thalermhle, Erlangen. You can register for it on meetup. Who we are This is event is organized by the Health Hackers Erlangen e.V., a non-profit to bring together academics, industry experts, students and medical professionals. To learn more about us check out our website. If you have any cool ideas for a challenge, workshop or want to help organize events, reach out to christian@healthhackers.de . We are always open to ideas and welcome everybody!`'",,Health Hackers Malaria Challenge,inClass,Detect Malaria Parasites in Blood Smear Images,meanfscore,health-hackers-malaria-challenge 988,"'`El objetivo de esta competencia es entrenar un modelo de Deep Learning para detectar tejido tumoral en cortes histolgicos. El cncer de mama es el cncer ms comn en mujeres. Identificar con precisin la presencia de cncer es una tarea clnica importante, donde mtodos automticos podran aplicarse para mejorar el tiempo y reducir el error del diagnstico. Cada muestra consiste en un recorte de 50x50 pxeles, obtenidos de imgenes histolgicas de mayor tamao. Cada muestra del set de entrenamiento tiene asociada una etiqueta binaria: 1 si el sector central de 32x32 del recorte contiene al menos un pixel con tejido tumoral, y 0 en caso contrario. La competencia ser ganada por aquel modelo que presente el mejor desempeo de clasificacin en el subconjunto de test. Agradecimientos Este dataset fue tomado de la competencia [Histopathologic Cancer Detection][1] .`'",,Histologa en Cancer de Mama,,,categorizationaccuracy,histologa-en-cancer-de-mama 989,'`Insert text here. Please use markdown.`',,hLR79wBWD822ZkE,inClass,[200115] Pingpong AI Research - 최우용님,categorizationaccuracy,hlr79wbwd822zke 990,"'`HMIF Tech Data Science Bootcamp Data Science bootcamp is a 3-week basic data science training that contains discussion and hands-on learning. Trained by speakers with various experience in data science competition, participants will get a variety of new knowledge about data science. Metric: RMSE (Root Mean Square Error).`'",,HMIF DS Bootcamp Task 2,inClass,Predict review rating,rmse,hmif-ds-bootcamp-task-2 991,"'`IIT Challenge adalah rangkaian kompetisi internal yang diadakan oleh Inkubator IT HMIF. Persoalan yang menggunakan dataset ini adalah prediksi rating suatu pelanggan dari teks review-nya. Data beserta penjelasannya terdapat di sini. Metrik yang digunakan adalah RMSE (Root Mean Square Error), semakin kecil nilai RMSE, semakin baik. Penting - Nama tim harus sama dengan nama tim saat registrasi, tiap tim hanya boleh mendaftarkan satu tim. - Hanya boleh mengundang anggota yang memang didaftarkan ke dalam tim. - Dataset ini hanya boleh digunakan untuk perlombaan ini. - Untuk kategori FUN, tambahkan [FUN] di depan nama tim, contoh: [FUN] NamaTim - Baca rules di sini Jika menemukan masalah, silakan kontak di grup komunitas. Selamat belajar dan berkompetisi dengan baik.`'",,Dataset 2 HMIF IIT Challenge,inClass,Lomba Data Mining HMIF IIT Challange,rmse,dataset-2-hmif-iit-challenge 992,"'`Start here if... You have some experience with R or Python and machine learning basics. This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition. Competition Description Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. Practice Skills Creative feature engineering Advanced regression techniques like random forest and gradient boosting Acknowledgments The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. `'",,Housing Prices Competition for Kaggle Learn Users,,,mae,housing-prices-competition-for-kaggle-learn-users 993,"'`The final hot dog recognition hackathon! Based on Silicon Valley movie: https://www.youtube.com/watch?v=vIci3C4JkL0&t=38s Your task: to make a revolutionary classifier that has defeated most investors and geeks from Silicon Valley! More precisely, you need to predict whether there is a hot dog in the image or not. The probability that there is a hot dog on the image must be written to the solution file. The quality of the model will be measured using the AUC-ROC metric, the public leaderboard will be built on 50% of the observations, the private leaderboard on the remaining 50%. Can you repeat the success of Jian Yuang'a? We believe in you!`'",,Hot-dog challenge! SPBU,,,auc,hot-dog-challenge!-spbu 994,"'` train.csv . , . test.csv, sampleSubmission.csv. submit .`'",,Hotel Reviews Classification,inClass,,meanfscore,hotel-reviews-classification 995,"'`Use application data to predict loan default. Try to create new features. Be careful, not all features are helpful.`'",,Car loan default,inClass,,auc,car-loan-default 996,"'` : , 15 . .`'",,HSE Math competition 1,,,auc,hse-math-competition-1 997,"'`Instruction for HW3 Influence Maximization Imagine each node is the people in the social community, we assume the celebrity one is who has more friends (degree) than others. If you want to hire the celebrity one to promote your products, your bugget would depends on the amount of their friends. If you have a limited budget (50 credits, thus 1 credit = 1 degree of node link) to selected the initial influencers, How can you explore the strategy to reach maximum activated nodes in very short periods. 1. You can load node edges dataset from Data > hw3infmax.csv 2. Each node edge included with weight between nodes perform as edge threshold, thus random generated in normal distribution form. 3. You can customize initial activated node to be an influencer from Data > initial_nodes.csv 4. You have limitation to choose initial nodes that not exceed 50 degree (perform as cost) in total of selected nodes. 5. You only have just 2 iteration and activated threshold not lower than 0.5 to spread the most Influence maximization nodes, thus baseline from example code is 651 activated nodes 6. You also apply the greedy optimization algorithm (slide page 14) to optimized influence maximization, all parameters would following by limitations rules. 7. Bonus point will be allocated to students who can composed your own code to improve computational time, activated model, etc. which perform better than example code. 8. The submission result MUST based on algorithms calculation. If you try to manipulated, fraud the submission result, your submission will reject. 9. Every submission will be testing again to verify with your selected initial nodes, your code and the result in the end of competition. 10. You should to upload the report of Homework3 which included with this following materials, to the MOODLE system in ZIP file format. I. Homework Report (2-4 pages, should be in English) explain the parameters you used for selected the initial nodes, experiment result in each iteration (you can plot your result in graph to benchmark with more iterations test), and discussion on you method and result. II. Initial_nodes.csv file of your selected to verify your submission result. III. Sources code, if you've implement any algorithm to find better result. 11. Final score will depends on ranking and quality of the homework report, better result in leader board with poor report quality would have less score. The Deadline is on 15 June 2018. 11:59pm. Kick Start Code import numpy as np import scipy as sp import networkx as nx import pandas as pd import csv import matplotlib.pyplot as plt %matplotlib inline #load node_edge and weight from CSV file node_edge = pd.read_csv(""hw3infmax.csv"") #Identify lenght of edges size print ""Total Edges = "", len(node_edge) Total Edges = 88234 # Draw Network Graph G = nx.Graph() i = 0 while len(node_edge) > i: # Add each edge to Node_edge G.add_edge(node_edge.loc[i,'from_node'], node_edge.loc[i,'to_node'], weight = node_edge.loc[i,'weight']) i = i+1 elarge = [(u, v) for (u, v, d) in G.edges(data=True) if d['weight'] > 0.5] esmall = [0.5 >= (u, v) for (u, v, d) in G.edges(data=True) if d['weight']] # Set Network Graph Size: [100, 100] plt.figure(figsize=(100,100)) A = np.array(nx.adjacency_matrix(G).todense()) # Converted Nodegraph to Adjacency Matrix pos = nx.spring_layout(G) # positions for all nodes nx.draw_networkx_nodes(G, pos, node_size=500, node_color='g') # Set Nodes Size and Color nx.draw_networkx_edges(G, pos, edgelist=elarge, width=2) nx.draw_networkx_edges(G, pos, edgelist=esmall, width=2, alpha=0.5, edge_color='b', style='dashed') # Set Edges Size, Color, Line nx.draw_networkx_labels(G, pos, font_size=20, font_family='sans-serif') # Set Labels Size plt.axis('off') plt.show() total_nodes = len(G) print ""total edges ="",i print ""total nodes ="",total_nodes # List all Nodes Parameters: Node Degree, Degree Centrality, Clustering , etc. deg_cen = nx.degree_centrality(G) print ""node, degree, degree_centrality, clustering"" i = 0 while total_nodes > i: print i, G.degree(i), deg_cen[i], nx.clustering(G,i) i=i+1 node, degree, degree_centrality, clustering 0 347 0.0859336305102 0.0419616531459 1 17 0.00421000495295 0.419117647059 2 10 0.00247647350173 0.888888888889 3 17 0.00421000495295 0.632352941176 4 10 0.00247647350173 0.866666666667 5 13 0.00321941555225 0.333333333333 6 6 0.00148588410104 0.933333333333 7 20 0.00495294700347 0.431578947368 8 8 0.00198117880139 0.678571428571 # load node_edge and weight from CSV file initial_nodes = pd.read_csv(""initial_nodes.csv"") submission_list = initial_nodes print ""Total Initial nodes ="",len(initial_nodes) Total Initial nodes = 4039 # List all Activated Nodes from Initial_node.csv print ""List of initial activated nodes and cost (degree)"" print ""node, cost"" i = 0 cost = 0 activated_list = [] while len(initial_nodes) > i: if initial_nodes.loc[i,'activated'] == 1: print initial_nodes.loc[i,'node'], G.degree(i) activated_list.append(initial_nodes.loc[i,'node']) cost = cost+G.degree(i) i=i+1 print activated_list print ""Number of initial nodes = "",len(activated_list) print ""Total Cost = "",cost List of initial activated nodes and cost (degree) node, cost 161 25 1648 25 [161, 1648] Number of initial nodes = 2 Total Cost = 50 # Activated Model iteration_time = 2 # Iteration round of Influence Maximization threshold = 0.5 # Threshold for activated for x in range(iteration_time): i=0 while len(activated_list) > i: neighbors_list = [n for n in G[activated_list[i]]] print ""neighbors = "",len(neighbors_list) j = 0 while len(neighbors_list) > j: weight_attr=nx.get_edge_attributes(G,'weight') print activated_list[i],""Activated node found neighbor"",neighbors_list[j] try: if weight_attr[(activated_list[i],neighbors_list[j])] > threshold: print ""ACTIVATED"",weight_attr[(activated_list[i],neighbors_list[j])] submission_list.loc[neighbors_list[j],'activated'] = 1 # Activated this node except KeyError: if weight_attr[(neighbors_list[j],activated_list[i])] > threshold: print ""ACTIVATED"",weight_attr[(neighbors_list[j],activated_list[i])] submission_list.loc[neighbors_list[j],'activated'] = 1 # Activated this node j=j+1 i=i+1 print ""========== END Round "",x+1,"" ============"" k = 0 cost = 0 activated_list = [] while len(submission_list) > k: if submission_list.loc[k,'activated'] == 1: print submission_list.loc[k,'node'], G.degree(k) activated_list.append(initial_nodes.loc[k,'node']) cost = cost+G.degree(k) k=k+1 print activated_list print ""Round "",x+1,""Activated Nodes"",len(activated_list) print ""Total Cost = "",cost neighbors = 25 161 Activated node found neighbor 0 161 Activated node found neighbor 258 ACTIVATED 0.55 161 Activated node found neighbor 9 161 Activated node found neighbor 142 ACTIVATED 0.54 161 Activated node found neighbor 271 161 Activated node found neighbor 277 ACTIVATED 0.65 # Write all Activated node result to submission.csv submission_list.to_csv('submission.csv',index=False) # Ready for Submission Acknowledgements Node edges dataset adopt from Stanford Large Network Dataset Collection. http://snap.stanford.edu/data/egonets-Facebook.html Example code is composed by Sutthisak Sukhamsri D10615809@mail.ntust.edu.tw`'",,HW3InfluenceMaximization,inClass,Reach maximum activated nodes with limited resources,meanfscore,hw3influencemaximization 998,"'`Mahasiswa diminta untuk memprediksi harga saham berdasarkan riwayat harga saham tsb. sebelumnya. Untuk melakukannya, mahasiswa dapat memanfaatkan model RNN. Sebetulnya, tim asisten sudah sempat membuat model RNN sederhana. Arsitektur dari modelnya adalah berikut: Sebuah RNN layer dengan 50 hidden states Sebuah Dense Layer dengan fungsi aktivasi linear Tapi setelah seminggu menjadi trader, rasanya tetap kurang stonks.`'",,IF4074 Praktikum 2 - RNN,inClass,Stonks or Not Stonks,rmse,if4074-praktikum-2-rnn 999,'`Sample Competition`',,IHS Markit Sample Competition,,,mape,ihs-markit-sample-competition 1000,"'`Welcome to the Fashion-MNIST Challenge! Fashion-MNIST is a dataset of Zalando's article imagesconsisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. Here's an example how the data looks (each class takes three-rows): Website reference: https://github.com/zalandoresearch/fashion-mnist`'",,ML Project - Image Classification,,,categorizationaccuracy,ml-project-image-classification 1001,"'`Este desafo consiste en predecir que clientes aceptarn una subscripcin de un producto de un banco, en una campaa de marketing, con respecto a las siguientes variables descritas en la seccin inferior. El submission deben ser las probabilidades del modelo de machine learning. 1 - edad (numerico) 2 - Trabajo : Tipo de Trabajo (categorical:administrador , operador-industria , tecnico , servicio , ejecutivo ,retirado, emprendedor, propio-empleado, ama-de-llaves, unemployed, estudiante, desconocido) 3 - Estado Civil : marital status (categorical: 'divorciado','casado','soltero','desconocido') 4 - Grado Educacion(categorical: 'primaria.4y','primaria.6y','primaria.9y','secundaria','analfabeto','tecnico','universitario','desconocido') 5 - Credito por Default: tiene credito por default? (Ya cuena con un credito) (categorical: 'no','si','desconocido') 6 - Deuda Casa: tiene ya un prstamo por una casa? (categorical: 'no','si','desconocido') 7 - Deuda Personal: tiene prestamo persoal? (categorical: 'no','si','desconocido') Relacionado al ultimo contacto que se realizo al cliente con la actual campaa: 8 - Contacto: tipo de comunicacion de contacto (categorical: 'celular','telefono') 9 - Mes: mes donde se realizo el ultimo contacto (categorical: 'enero', 'febrero', 'marzo', , 'noviembre', 'dciembre') 10 - Dia semana: dia donde se realizo el ultimo contacto (categorical: 'lunes','martes','miercoles','jueves','viernes') 11 - Duracion: Duracion del ultimo contacto en segundos (numeric). IMPORTANTE: este atributo se relaciona con el target (e.g.,si la duracion fue 0 entonces y='No'). Otros Atributos: 12 - campaign: number of contacts performed during this campaign and for this client (numerico, includes last contact) 13 - pdias: numero de dias que ha pasado desde la ultima vez que se contacto con el cliente (numeric; 999 significa que el cliente no ha sido contactado) 14 - Llamadas previas: numero de veces contactos antes de esta campaa con este cliente (numerico) 15 - presultado: resultado de las campaas previas (categorical: 'fracaso','inexistente','exito') Atributos Economicos y Sociales 16 - tasaVarEmp: tasa de variacion de empleo - indicador cuartil o trimestral (numerico) 17 - indicador consumidor precio: Indicador del precio del consumidor - indicador mensual (numerico) 18 - indicador confianza consumidor: Indicador de la confianza del consumidor - Indicador Mensual (numerico) 19 - indicador macro: Indicador Macroeconomico del banco - Indicador diario (numerico) 20 - ind. Cuartil emp: numero de empleados - indicador cuartil o trimestral (numerico) TARGET: 'si','no'. Convertir estas variables a numercias SI : 1 y NO : 0`'",,INF648 - Curso de Aprendizaje Automtico,,,auc,inf648-curso-de-aprendizaje-automtico 1002,"'`o Le but de ce projet est didentifier les paires de phrases similaires sur une chelle de 0 5, tel que dfini ici : https://www.aclweb.org/anthology/S17-2001.pdf . Lensemble de donnes vous indique la similarit smantique entre deux phrases o 5 reprsente 2 phrase compltement quivalentes et 1 reprsente 2 phrases sans aucune similarit. La moyenne de 5 valuation humaines est utilise comme valeur prdire. o Dataset: STS-B http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark o Mtriques dvaluation : Nous allons utiliser la corrlation de Spearman, une mesure standard dans ce type de tche. Elle permet de mesurer la corrlation entre les scores de votre systme et les scores humains disponibles dans lensemble de test. Nous avons mis votre disposition un starter notebook dans la section notebooks.`'",,INF8460 STS-B _ A19,inClass,Similarité sémantique textuelle,spearmanr,inf8460-sts-b-_-a19 1003,"'`Energy consumption forecast The goal of this competition is to accurately predict the consumed energy for each unit. To do so, a few different measured variables that should be related to the output are provided. These variables are values from different sensors etc.`'",,Energy Consumption,,,mse,energy-consumption 1004,"'`The Sydney Innovation Week 2020 Vaccine Sentiment Coding Challenge is hosted as part of the University of Sydney's Innovation Week. Innovation Week is organised by the Innovation Hub, and the Challenge itself is facilitated by the Sydney Informatics Hub, working with Associate Professor Adam Dunn from the School of Medical Sciences. We are inviting students, professionals, and researchers to participate in the challenge to help understand vaccine related misinformation. The spread of misinformation can undermine confidence in vaccination. The capacity for social media to quickly and effectively spread information or misinformation is a pressing question for governments and global agencies. Tools such as deep learning, machine learning, sentiment analysis and text mining could provide solutions to help monitor the emergence of vaccine misinformation that underpin mitigation strategies and limit damage to vaccine confidence. You are invited to assist Assoc. Prof. Adam Dunn from the Faculty of Medicine and Health in designing a classifier to find vaccine sentiment in Twitter posts. Innovation Week celebrates the ground-breaking discoveries and transformative inventions from our academics and students. It is organised by the University of Sydney Innovation Hub. The Sydney Informatics Hub (SIH) is a Core Research Facility of the University of Sydney. SIH provides support, training, and expertise on research data, analyses and computing for research staff, students and affiliates of the University. We deliver policies, systems, advice, engineering and training to research staff, students and affiliates of the University. The University of Sydney Discipline of Biomedical Informatics and Digital Health in the Faculty of Medicine and Health Home to an interdisciplinary team of educators, researchers, and clinicians, we do impactful research at the intersection of information and health and equip new generations of health professionals with digital health skills. https://www.sydney.edu.au/medicine-health/schools/school-of-medical-sciences/discipline-of-biomedical-informatics-and-digital-health.html`'",,Sydney Innovation Challenge 2020,inClass,Classify vaccine sentiment in Twitter posts,f_{beta},sydney-innovation-challenge-2020 1005,"'`L'quipe du Challenge Data de l'Innovation Cup vous souhaite la bienvenue pour ce premier hackathon autour des donnes issues de rgates ! A laide des seules donnes collectes par les capteurs de notre partenaire MetaSail, vous disposez de 24h pour retrouver le classement complet de chaque course partir des informations de parcours des bateaux. Dans les donnes d'entrainement, vous trouverez le temps de course ralis par chaque bateau. A vous de prdire cette valeur sur les bateaux du jeu d'valuation pour ensuite pouvoir ordonner l'ensemble des bateaux sur chaque course et retrouver le classement final. Jusqu' prsent, les donnes fournies par MetaSail sont issues de deux rgates composes de 6 et 3 courses respectivement et vous donnent des indications sur la trajectoire GPS des bateaux et le vent. Les donnes du challenge sportif de l'Innovation Cup nous parviendrons au fur et mesure des courses de la journe et viendront complter le jeu de donne prsent dans l'onglet Data. Evaluation du Challenge Data Innovation Cup Vous pouvez soumettre vos prdictions de temps de parcours sur cette plateforme jusqu' 20 fois par jour afin de vous valuer. Nanmoins, vous serez class sur vos prdictions du rang ET du temps de parcours grce notre mtrique - dtaille dans l'onglet Evaluation. Vous pouvez nous soumettre vos prdictions par mail jusqu' 8 fois par jour - sous le format sample_sub_coachs__teamname#123 - et ainsi consulter votre classement sur notre Public Leaderboard. Comment se droule une course ? 1 - Passage de la ligne de dpart 2 - Avance face au vent 3 - Passage derrire la boue au vent 4 - Avance sous le vent 5 - Contour de la boue de dpart 6 - Dbut du second tour 7 - Avance face au vent 8 - Passage derrire la boue au vent 9 - Avance sous le vent 10 - Passage de la ligne d'arrive Bon vent tous !`'",,Innovation Cup Hackathon,inClass,Entraînez vos modèles sur les données de régates,rmse,innovation-cup-hackathon 1006,"'`Can you predict the quality of Iron Concentrate after the flotation process? this will be your challenge in this InClass competition designed for the MBA Class of UM6P Morocco. The Flotation is one of the most common processes used to concentrate ores ( phosphates ores, copper ores, iron ores ). In this challenge, you will be provided by the production data of an iron flotation factory, and your main goal is to predict how much impurity is in the iron ore concentrate at the end of the process. As this impurity (% Silica) is measured manually in the lab every hour, if we can predict how much impurity is in the ore concentrate without waiting for the lab results each our, we can help the engineers by giving them early information to take action. Hence, they will be able to take corrective actions in advance (reduce impurity, if it is the case) and also help the environment (reducing the amount of ore that goes to tailings as you reduce silica in the ore concentrate).`'",,Iron Flotation: Quality prediction,inClass,Predict the quality of the iron flotation process using real world data,rmse,iron-flotation:-quality-prediction 1007,'`Your goal is to predict future energy consumption in different cities.`',,istc course timeseries competition,inClass,competiton on time series forecasting for students,mape,istc-course-timeseries-competition 1008,"'`Greetings!! This competition aims to test all the knowledge you have acquired over the past couple weeks to successfully build a model over a completely new ; never before seen dataset. NOTE:USE OF NEURAL NETWORKS IS NOT ALLOWED`'",,ISTE ML SMP 2020,,,auc,iste-ml-smp-2020 1009,"'` , {IT.IS} Upgrade 2.0, , .`'",,{IT.IS} Upgrade 2.0,,,f_{beta},{it.is}-upgrade-2.0 1010,"'`Hello and welcome to the ITU Data Science Bowl of 2019! This is the first edition of an initiative to aid data science students at ITU with (applied) machine learning and general data analysis. The competition lasts over the summer, since this seems to be the only time during the year, where we can catch a break - it also (hopefully) gives everyone time enough to dive into the problem and learn some programming, machine learning, or something entirely different (plotting with Python perhaps)! Problem This first time, we're starting off with the infamous binary classification problem: given an observation you have to predict which of two classes it belongs to. The training set contains 500 observations (with corresponding correct classes) and the test set contains 2500 observations (with unknown classes). You can read more about the data files under the ""Data"" tab above. Kaggle first-timer? If this is your first time on Kaggle, allow me to introduce you to the general gist of it. Firstly, Kaggle is an online capital for data scientists - not just machine learning, but anything regarding data science is discussed here. Secondly, it's a great website for learning about machine learning (whether it be applied or theoretical) and analysis of real-world data. Right now, you're on an in-class competition page. This is a private competition, but there are many public competitions on Kaggle as well. In the menu above, there are some tabs (""Data"", ""Kernels"", ""Leaderboard"", etc.) that will provide you with some information about this competition. The ""Kernel"" tab is where you can create your own (or see others') online kernel notebook or script, where you can use either Python or R. The ""Data"" tab will tell you about the data for this competition. The ""Leaderboard"" tab will tell you the current standing. And so on. There is a Python script with a baseline model (Logistic Regression with regularization) and a Python notebook (exploratory data analysis) publicly available to you under the ""Kernels"" tab to help you start Kaggling! How does a Kaggle competition work? There are a few different ones, but this one works like so: A problem has been stated, which is some kind of machine learning task (in this case it's classification), where you've been given some training data and test data. You have to develop a machine learning model using the training data, predict on the test data, and then upload your predictions onto this competition site (under ""My Submissions""). Your predictions will be evaluated against the true values and a measure of your performance will be returned. The returned measure is calculated against 40% of the test data (called the public test data), and your score on the rest of the test data (the private test data) is kept secret until the end of the competition. You can see everyone's best public test score under the ""Leaderboard"" tab. However, the competition will be decided by the private test score - so you don't know, if you've won until the very end. This is to insure that people who overfit to the public test data (thereby making a model that doesn't generalize well) do not win. Therefore, you have to make your model generalize as well as possible to win! In this competition, you can submit predictions of the test set 3 times a day, i.e., you can improve upon your model every day and validate it with the provided training set, but you can only see how your model perform on the public test set 3 times each day.`'",,ITU Data Science Bowl 2019,inClass,Data science students of ITU improving their machine learning skills together!,logloss,itu-data-science-bowl-2019 1011,"'`In this challenge you will encounter with payment transactions fraud case. Solution of fraud cases are one of the most popular machine learning application in fin-techs. When we consider how hard to find fraud data, we hope you enjoy this case. iyzico founded in 2013, iyzico is the brainchild of German-born Turks, Barbaros zbuutu and Tahsin Isn, who moved from Germany to Turkey to set up the business. Started out with 3 people, iyzico is now Turkeys fastest growing fintech company composed of more than 130 people providing secure payment solutions to online sellers of different sizes as well as online shoppers.`'",,IYZICO Projesi,inClass,Data & Analyitcs Challenge 1.Online Proje 15-21 Ocak,f_{beta},iyzico-projesi 1012,"'` #6 JDS , Fashion MNIST (https://github.com/zalandoresearch/fashion-mnist), 70,000 10 . (28 28 ). 60,000 10,000 , . , ( ): -- KNeighborsClassifier -- LogisticRegression -- Naive Bayes -- Bayesian Classifier -- Linear SVC -- DecisionTreeClassifier . ( ). PCA ( ) . jupyter notebook ( ).`'",,JDS.Howework.Competition,inClass,Junior Data Scientist Homework,categorizationaccuracy,jds.howework.competition 1013,"'`Sehr geehrte Studierende, diese Kaggle-Competition ist nur fr Sie! Lernen Sie hier, mit Hilfe Ihres Klassifikators (z.B. einem k-Nearest-Neighbour-Klassifikators) eine Vorhersage zu treffen. Unter dem Reiter ""Data"" erhalten Sie die Trainingsdaten in train.csv, um das Target-Label y fr die Daten in test.csv vorherzusagen. Zum Format der Abgabe gibt es eine Datei SampleSubmission.csv, welche das bentigte Format erlutert. Kurz gesagt ist eine .csv-Datei verlangt der Form Id,y 0,1 1,0 2,0 3,0 4,1 ... 8099,1 Um die Daten zu holen, knnen Sie das Kaggle API benutzen. Dazu mssen Sie das CLI installieren und Ihre Login-Credential in Ihrem Profil holen und unter ~/.kaggle/kaggle.json abspeichern. Danach: kaggle competitions download -c male-daan-schnell-mal-klassifizieren Laden Sie Ihre .csv-Datei als Kaggle-Submission hoch! und kaggle competitions submit -c male-daan-schnell-mal-klassifizieren -f clever_submission.csv -m ""mein zweiter Versuch heute. Erwartete Genauigkeit 95.3%"" Natrlich geht dies alles auch per Maus und Klick-Klick. Achtung! Sie haben pro Tag nur die Berechtigung zwei Submissions hochzuladen. Alles andere wird nicht gezhlt. Also berlegen Sie sich gut, welche Prediction Sie hochladen`'",tabular data,Schnell-Mal-Klassifizieren,inClass,"Des KNNs Traum, des Perzeptrons 2D-Alptraum",categorizationaccuracy,schnell-mal-klassifizieren 1014,"'`On the Internet, freedom of speech through anonymity has been considered advantageous in that many people can express their opinions transparently. However, at the same time, there are also negative impacts, such as severely aggressive or insulting threats committed to a specific individual or group. Recently, in Korean society, it has been publicized that a series of issues were presumed to be caused by target-specific toxic comments. More specifically, after some celebrities appealed mental damage continuously and openly, there were cases leading to a tragic decision. After, hate speech in on-line space has emerged as a critical social issue, a methodology that delicately prevents problematic text has not been immediately carried out by on-line service providers, and the comment system of Entertainment news is currently closed on main news platforms. Without a doubt, various attempts have been proposed to handle this problem, and the term existence-based detection system is a representative method. The dictionary of publicly available profanity terms for The Korean language was also distributed and some systems are also known to train and deploy the system based on these factors. Notwithstanding the method of identifying hate speech by lexical matching is straightforward, there are limitations. Such terms may be meta-language that mentions only specific content, and profanity terms can appear in a modified format. Even without such terms, bias and sexual offense implicated in the sentence may be aggressive to target figures or other readers. In this regard, we present the first Korean corpus annotated with gender bias. We collected comments from the Korean entertainment news platform that incorporates a wide range of users in Korea. We believe that our initial efforts in constructing Korean gender bias dataset that also regards insulting, along with the detection system, will be supportive for both social good and sociolinguistic studies.`'",text data,Korean Gender Bias Detection,inClass,Identify gender bias in Korean entertainment news comments,f_{beta},korean-gender-bias-detection 1015,"'` . 14 : - FAQ - - - SIM- FAQ - , ( ), , . . , (, ) ( ). ODS DataSouls , https://datasouls.com/c/nghack2019`'",text data,OCRV Test Task,inClass,Продемонстрируйте навыки классификации текста,meanfscore,ocrv-test-task 1016,"'`The MNIST dataset is a classic of machine learning, in which the goal is to correctly classify 28 x 28 grayscale images of handwritten digits 0 - 9. In this competition, you're going to be working on a similar but more challenging task: correctly classifying 28 x 28 grayscale glyphs of letters A - J, a subset of the so-called notMNIST dataset. This is a great way to experiment with what you'e learned about convolutional neural networks. Acknowledgements Thanks to Yaroslav Bulatov for creating this dataset.`'",image data,notMNIST Competition,inClass,Correctly classify images of letters A - J in different fonts ,categorizationaccuracy,notmnist-competition 1017,"'`Welcome to the ML Hackathon 2019, conducted by the Developer Students Club at ASE, Coimbatore ! This dataset contains information collected from various patients' records, based on various medical observations and diagnostic tests. Your task is to determine whether a patient suffers from gastric ulcers or not, based on the given information Disclaimer This dataset has been largely fabricated and must NOT be used anywhere else for referral/research/citation purposes or otherwise`'",tabular data,ML Hackathon 2019 Q1,inClass,Can you predict the presence of a gastric ulcer in a patient? ,categorizationaccuracy,ml-hackathon-2019-q1 1018,"'`Welcome to the ML Hackathon 2019, conducted by the Developer Students Club at ASE, Coimbatore ! This dataset contains of three fruits - apples, oranges and pomegranates. Your task is to classify the test data images into one of these three cateogories. While classifying the test images, please follow our notation: 0 - pomegranates 1 - oranges 2 - apples Acknowledgements: [1] Horea Muresan, Mihai Oltean, Fruit recognition from images using deep learning, Acta Univ. Sapientiae, Informatica Vol. 10, Issue 1, pp. 26-42, 2018. [2] Sriram Reddy Kalluri: 'Fruits fresh and rotten for classification: Apples Oranges Bananas' dataset [3] Google images`'",image data,ML Hackathon 2019 Q2,inClass,Train a neural network to classify 3 kinds of fruit!,categorizationaccuracy,ml-hackathon-2019-q2 1019,"'` IEEE 2008 . , . 500 . .`'",time series,Time Series Classification,inClass,Time Series Toy Competition,auc,time-series-classification 1020,"'`EVRY Norge ( Infopulse Ukraine - ) 3D Digital Twin , ' ( , , , ..) (Digital Twin) IoT , . ' multiple regression for time series data ( ) Valhall. kernels end-to-end baseline jupyter notebook , .`'",time series,Multiple regression for time series data,inClass,Oil production forecast in the Norwegian oil region Valhall,mse,multiple-regression-for-time-series-data 1021,"'`About Us One Fourth Labs is an IIT Madras incubated startup with a goal to make India ready for the AI age. We want to skill Indian workforce in the areas of Artificial Intelligence (AI) at almost one-tenth the industry upskilling price. Our flagship online school PadhAI provides India-specific courses on AI, and is open to all students, faculty, and professionals with a basic background in mathematics and python Task The goal is to identify the vowel and the consonant of each character image using Convolutional Neural Networks. Evaluation Metric Submissions are evaluated on Accuracy Score between the predicted and the actual labels on test dataset`'",image data,PadhAI: Hindi Vowel - Consonant Classification,inClass,Can you predict the vowel and consonant of a Hindi character image?,categorizationaccuracy,padhai:-hindi-vowel-consonant-classification 1022,"'`About Us One Fourth Labs is an IIT Madras incubated startup with a goal to make India ready for the AI age. We want to skill Indian workforce in the areas of Artificial Intelligence (AI) at almost one-tenth the industry upskilling price. Our flagship online school PadhAI provides India-specific courses on AI, and is open to all students, faculty, and professionals with a basic background in mathematics and python Task The goal is to identify the vowel and the consonant of each character image using Convolutional Neural Networks. Evaluation Metric Submissions are evaluated on Accuracy Score between the predicted and the actual labels on test dataset`'",image data,PadhAI: Tamil Vowel - Consonant Classification,inClass,Can you predict the vowel and consonant of a Tamil character image?,categorizationaccuracy,padhai:-tamil-vowel-consonant-classification 1023,'`Prever comportamento para reter clientes.`',tabular data,Predio de Churn,,,auc,predio-de-churn 1024,"'`Description Select at least two algorithms for pitch estimation (preprocessing, frequency estimation and/or postprocessing) and voicing detection. Implement the selected methods as a kaggle python kernel or in any language and compare their performance on the test database. You can use the provided baseline kernel as starting point or reference. The baseline kernel uses a simple algorithm based on the autocorrelation to compute the pitch, without any preprocessing or post-processing. Try to improve the results with the use of standard pre- and post-processing methods, new algorithms, a combination of systems, parameter tuning or machine learning algorithms. Optionally, you can use the FDA-UE and PTDB-TUB databases for tuning and training. Report the results of the assignment using a 4-pages paper format. You can use, for instance, the templates in http://www.icassp2016.com/papers/PaperKit.html#Templates. In the report you have to briefly describe the selected algorithms and initial source code including the corresponding references. Then you have to mention your experiments or original contributions and the obtained results (Detailed results on the FDA-UE and public score on the Leaderboard). Upload the complete source code as a kaggle kernel o to a git repository (as github) and provide a link to it in the report.`'",audio,Pitch estimation and voicing detection,inClass,Analysis of basic properties of the speech signal: voicing and pitch,rmse,pitch-estimation-and-voicing-detection 1025,"'`About Us One Fourth Labs is an IIT Madras incubated startup with a goal to make India ready for the AI age. We want to skill Indian workforce in the areas of Artificial Intelligence (AI) at almost one-tenth the industry upskilling price. Our flagship online school PadhAI provides India-specific courses on AI, and is open to all students, faculty, and professionals with a basic background in mathematics and python Task The goal is to identify the presence of a character in images using MP Neuron / Perceptron / Perceptron with sigmoid. The character images are compiled in Tamil, Hindi and English. We have altered the task in 4 levels with increase in data complexity. Evaluation Metric Submissions are evaluated on Accuracy Score between the predicted and the actual labels on test dataset Acknowledgements Background Data: http://www.image-net.org/ Tamil Character Data: http://www.jfn.ac.lk/index.php/data-sets-printed-tamil-characters-printed-documents/ Hindi Character Data: https://www.kaggle.com/ashokpant/devanagari-character-dataset`'",image data,PadhAI: Text - Non Text Classification Level 4b,inClass,Can you predict whether an image has TEXT or NOT?,categorizationaccuracy,padhai:-text-non-text-classification-level-4b 1026,"'`About Us One Fourth Labs is an IIT Madras incubated startup with a goal to make India ready for the AI age. We want to skill Indian workforce in the areas of Artificial Intelligence (AI) at almost one-tenth the industry upskilling price. Our flagship online school PadhAI provides India-specific courses on AI, and is open to all students, faculty, and professionals with a basic background in mathematics and python Task The goal is to identify the presence of a character in images using MP Neuron / Perceptron / Perceptron with sigmoid. The character images are compiled in Tamil, Hindi and English. We have altered the task in 4 levels with increase in data complexity. Evaluation Metric Submissions are evaluated on Accuracy Score between the predicted and the actual labels on test dataset Acknowledgements Background Data: http://www.image-net.org/ Tamil Character Data: http://www.jfn.ac.lk/index.php/data-sets-printed-tamil-characters-printed-documents/ Hindi Character Data: https://www.kaggle.com/ashokpant/devanagari-character-dataset`'",image data,PadhAI: Text - Non Text Classification Level 4a,inClass,Can you predict whether an image has TEXT or NOT?,categorizationaccuracy,padhai:-text-non-text-classification-level-4a 1027,"'`Welcome to our first graded competition. Check out the starter Kernel to get going.`'",image data,Oxford Fast AI Week 2,inClass,Graded contest: Classifying flowers,categorizationaccuracy,oxford-fast-ai-week-2 1028,"'`Everyday there are hundreds of thousands new products added to Shopee. To make relevant products easily discoverable, one fundamental challenge is to accurately extract relevant information from large volume of products. For NDSC 2019, we present this real-world challenge of building an automatic solution to extract product related information through machine learning techniques. The image data is now available. To download the image data, see the Data page. The theme of NDSC 2019 is Product Information Extraction in the Wild - a challenge to build up: an automatic solution to extract product related information from large volume of images and free text data. There will be two main competitions: beginner product category classification and advanced product information extraction. Participants may enter either (or both) of these two competitions, and can choose to tackle any (or all) of the data sources provided on the Data pages: Beginner Category and Advanced Category. On this Kaggle page, we introduce the advanced product information extraction task with advanced category information. Participants are required to detect the brand, model, and other attributes. For those who are also interested in the junior-level task using Beginner Category, please refer to Product Category Classification Kaggle page for more details.`'",image data,National Data Science Challenge 2019 - Advanced,inClass,Product Information Extraction,map@{k},national-data-science-challenge-2019-advanced 1029,"'`As part of the Kainos AI Camp, we are running our own (small) Kaggle competition. Unlike regular Kaggle competitions, only other Campers are taking part. The aim is to achieve the highest accuracy. How to form a team You can have a team of up to three people. First you will have to accept the competition rules. To form a team, click on My Team from the dashboard. There you can search for and add other Kagglers to your team. Please note: until your teammates have accepted the competition rules, they will not show up here. Be sure to designate a Team Leader. If you and someone you want to work with have both formed teams, it is possible to merge teams (as long as the total number of people is below 3). The aim of the competition You will be given a dataset containing images, each of which belongs to one of eight different classes. These are: airplane, car, cat, dog, flower, fruit, motorbike and person. Your task is to train a classifier that will label each image correctly. The scoring will happen on a part of the dataset that you are not given to train on, so make sure your validation accuracy is high! If you overfit on the training data, your submission will have a low score! The final score is the accuracy (technically the cross-categorical accuracy) that you achieve on the test dataset. How to submit You can make multiple submissions - although you are limited to 10 submissions per day. The top two submissions (as in, the two best submissions) from each team will be scored at the end. Kaggle will automatically select the top two, although you can overrule this if you choose to. Advice Convolutional Neural Nets are much better at classifying images than other kinds of Neural Nets. However, you can also experiment with different kinds of classifiers we have learned about. Good solutions will display some understanding of the underlying theory, so if you get stuck you may want to read some more about how CNNs work. If you're very serious about winning, you have the entire weekend to work on your solution. Prizes Every member of the winning team will get a mystery prize, which is being awarded at the end of the Hackathon on Saturday. We'll announce the winners on the day of the submission deadline, though.`'",image data,Cats vs Dogs vs More,inClass,There are eight classes total - the highest accuracy wins!,categorizationaccuracy,cats-vs-dogs-vs-more 1030,"'`O banco de dados de avaliao de carros foi derivado de um modelo de deciso hierrquica simples, originalmente desenvolvido para a demonstrao de DEX, M. Bohanec, V. Rajkovic: sistema especialista para a tomada de decises. Sistemica 1 (1), pp. 145-157, 1990). O modelo avalia carros de acordo com a seguinte estrutura conceitual: Nmero de Atributos: 6 Valores dos Atributos: buying (preo de compra) v-high, high, med, low maint (preo da manuteno) v-high, high, med, low doors (nmero de portas de portas) 2, 3, 4, 5-more persons (capacidade das pessoas) 2, 4, more lug_boot (tamanho do porta malas) small, med, big safety (segurana estimada) low, med, high CLASSES vgood = 1 good = 2 acc = 3 unacc = 4`'",tabular data,Avaliao de Carros,,,logloss,avaliao-de-carros 1031,"'`Bienvenido a la primera competicin de la comunidad. Podrs predecir el resultado de futuras batallas pokemon? Para hacerlo tendrs a tu disposicin las caractersticas de todos los pokemon y los resultados de algunas batallas. Tres archivos estn disponibles. El primero contiene las caractersticas de Pokmon (la primera columna es el id del pokemon). El segundo contiene informacin sobre batallas anteriores. Las dos primeras columnas son los identificadores del Pokmon y la tercera el identificador del ganador. Importante: El pokemon en la primera columna ataca primero. El objetivo es desarrollar un modelo de machine learning capaz de predecir el resultado de futuras batallas pokemon. El mejor modelo gana!`'",tabular data,MLH - Pokemon Challenge,inClass,¿Serás capaz de predecir el resultado de los combates pokemon?,categorizationaccuracy,mlh-pokemon-challenge 1032,'`This is the home page of our first competition. Make you first ML models to predict stroke.`',tabular data,ML in biology,inClass,Learn how to use ml methods in biology,auc,ml-in-biology 1033,"'`Welcome to Week 6 of the Machine Learning Challenge. This week you get the chance to test the skills that you acquired over the past 5 weeks. This week will be in the form of a competition. You will have to solve a problem based on a real-world dataset and compete with your colleagues for a place on the leaderboard.`'",tabular data,ML Challenge,inClass,Week 6 - Final Project,auc,ml-challenge 1034,"'`Age group prediction challenge Your task is to predict the age group (1 = ""18-29"", 2 = ""30-49"", 3 = ""50-"") of simulated webshop customers. The webshop uses this prediction to improve its recommendation in case a customer does not provides his/her age. Features are weekday of the transaction, the customer's region and the aggregated basket contents of three product groups ""wine"", ""beer"", and ""spirits"". Use the training dataset to train your machine learning model Use the test set to create your predictions. You need to export your predictions in the form shown in sample_submission.`'",tabular data,Predicting Age Groups,inClass,"In-class competition for ""pk-hska""",categorizationaccuracy,predicting-age-groups 1035,"'`Aufgabe Wir haben hier einen Datensatz bereitgestellt ber Schler und deren Noten im Fach Mathematik in der Sekundarstufe in Portugal. Der Datensatz ist eher klein und besteht aus ungefhr 400 Datenpunkte mit jeweil 31 Features und 3 Zielvariabeln. Wir haben den Datensatz in traindata.csv und testdata.csv aufgeteilt, wobei 36 Datenpunkte die Test Daten ausmachen. Ziel ist es zuerst die Daten zu analysieren und allfllige Probleme fest zu stellen, die wir im Verlauf des Tages beheben knnen. Wir werden diesen Zwischenschritt im Plenum besprechen. Anschliessend soll ein einfaches Lineares Model auf dem Train Datensatz erstellt werden, welches aus den anderen Zielvariablen G1 und G2 die Zielvariable G3 bestimmt. Dies soll den Einstieg in scikit-learn vereinfachen, da wir vorerst nur zwei numerische Features betrachten. Wir werden diesen Zwischenschritt im Plenum besprechen. Das soeben erstellte Lineare Modell ist sehr gut, hat aber auch keine schwierige Aufgabe, da G1 und G2 stark mit G3 koreliieren. Deshalb geht es jetzt darum, ein ""richtiges"" Lineares Model zu erstellen, was von den tatschlichen Features auf die Zielvariable G3 geht. Achten Sie dabei auf die Feature Selection und das Encoding von Featuren (One-Hot Encoding). berprfen Sie Ihr Modell zuerst mit einem Validation Set und submitten Sie anschliessend eine Prediction auf Kaggle. Wir werden diesen Zwischenschritt im Plenum besprechen. Konnten Sie eine Prediction submitten? Bravo! Nun haben Sie freie Hand. Sie knnen beispielsweise Data Cleaning anwenden, andere Modelle ausprobieren (und Hyper Parameter optimieren), Feature Engineering betreiben, weitere Daten anreichern, - seien Sie kreative.`'",tabular data,Machine Learning Lab - CAS Data Science HS 20,inClass,Mask Edition,mae,machine-learning-lab-cas-data-science-hs-20 1036,"'`Description At Shopee, sellers list thousands of products for sale on our platform. A better understanding of users' tastes and preferences for products can help Shopee design better promotions and recommendations for our users. To do that, we conduct market basket analysis which allows us to identify the relationship between different combinations of products that users buy. We are interested in finding association rules between combinations of different products. These association rules can help to uncover regularities in purchasing behaviors of our users. For example, an association rule between 3 products, {Product A & Product B} {Product C}, would indicate that a user buying both Product A & Product B would likely buy Product C as well. Confidence is a measure that is used to indicate such tendencies and can be used to determine the association for varying numbers of products. For the purpose of this question, we will be using confidence to calculate the association for 2 products and 3 products. Confidence for two products: Confidence for three products: Or Basic Concepts Confidence is defined as the tendency that given product A is purchased, that product B will also be purchased. Each orderid represents a distinct transaction that has occurred. Each itemid represents a unique product that is sold on Shopee. A transaction can contain 1 or more itemid(s). If 2 or more itemid(s) share the same orderid, they are purchased together in a single transaction. An itemid can appear many times in different orderid(s), which means that the product was purchased many times in different transactions. Task Please calculate the confidence values for all the association rules provided in the rules.csv file. Tips: A > B and B > A have different confidence and should be calculated separately A & B > C and B & A > C are identical association rules and will yield the same confidence Example Case 1: A > B 8 orderid have itemid 7917849 (31338643584868, 31364354557783, 31368958440199, 31369772179043, 31371954695064, 31375314731607, 31377601474289, 31379328498817) 6 orderid out of above have both itemid 7917849 and itemid 18642183 (31338643584868, 31368958440199, 31369772179043, 31371954695064, 31375314731607, 31377601474289) Confidence (7917849 > 18642183) = 6 / 8 = 0.75 Please submit 750 (times by 1000 and round down to integer) Case 2: A&B > C 7 orderid have itemid 2363580843 and itemid 2002243261 (31342449702678, 31365563352719, 31366764361012, 31371701813987, 31372163437582, 31373610230585, 31381568386099) 6 orderid out of above have all itemid 2363580843, itemid 2002243261 and itemid 1993068031 (31342449702678, 31365563352719, 31366764361012, 31372163437582, 31373610230585, 31381568386099) Confidence (2363580843 & 2002243261 > 1993068031) = 6 / 7 = 0.857143 Please submit 857 (times by 1000 and round down to integer) Case 3: A > B&C 9 orderid have itemid 1089203645 (31351735245918, 31367488312991, 31372554805324, 31373458010259, 31373724807962, 31374927925523, 31375318612401, 31375354382289, 31384570619582) 7 orderid out of above have all itemid 1089203645, 431391770 and 1216842899 (31351735245918, 31372554805324, 31373458010259, 31373724807962, 31374927925523, 31375318612401, 31375354382289) Confidence (1089203645 > 431391770 & 1216842899) = 7 / 9 = 0.777777 Please submit 777 (times by 1000 and round down to integer)`'",tabular data,Market Basket - ID NDSC 2020,inClass,Challenge #2 for Beginner Category,categorizationaccuracy,market-basket-id-ndsc-2020 1037,"'`This is a dataset on default payments of credit card clients in Taiwan from April 2005 to September 2005. We aim to use this competition to try out some basic skills in data science and machine learning. It will be divided into five main stages, which will span through the three weeks of onboarding. Trial of Basic Machine Learning Models Improve Model Performance by doing Exploratory Data Analysis Considering the imbalanced data problem and Data Augmentation Dimension Reduction and Feature Selection Ensemble models (Bagging, Boosting, Stacking)`'",tabular data,NCTU BDALAB 2020 Onboard,inClass,2020 Onboarding for NCTU BDALAB,auc,nctu-bdalab-2020-onboard 1038,"'`This is a competition for participants of the Open Machine Learning Course by ODS in Dubai. The goal is to predict musical genres of a track given some features extracted from the wave file and some metadata fields. One track can have several genres. For this task you can extract any features which you want, and you can use any ML models supported by Kaggle Kernels. You should write your solutions only in kaggle kernels. You can keep it private until the end of competition, but you if you want you can public any notebook before the end. Cheating = ban from ODS. If you cannot show (publish) your notebook after the end of the course, which replicates your score, you will be excluded from the leaderboard. If you want to get additional credits on the course, you may want to write an article using Kaggle Kernels about F1 score function. Page of the course`'",tabular data,"MLClass Dubai by ODS, Lecture 6 HW",inClass,"This is a homework competition for participants of MLClass Dubai by ODS, Lecture 5 HW",meanfscore,"mlclass-dubai-by-ods,-lecture-6-hw" 1039,"'`Logistics Performance Due to the recent COVID-19 pandemic across the globe, many individuals are increasingly turning to online platforms like Shopee to purchase their daily necessities. This surge in online orders has placed a strain onto Shopee and our logistics providers but customer expectations on the timely delivery of their goods remain high. On-time delivery is arguably one of the most important factors of success in the eCommerce industry and now more than ever, we need to ensure the orders reach our buyers on time in order to build our users confidence in us. In order to handle the millions of parcels that need to be delivered everyday, we have engaged multiple logistics providers across the region. Only the best logistics providers that are able to meet Shopees delivery standards are partnered with us. The performance of these providers is monitored regularly and each provider is held accountable based on the Service Level Agreements (SLA). Late deliveries are flagged out and penalties are imposed on the providers to ensure they perform their utmost. The consistent monitoring and process of holding our logistics providers accountable allows us to maintain our promise of timely deliveries to our buyers. Task Identify all the orders that are considered late depending on the Service Level Agreements (SLA) with our Logistics Provider. For the purpose of this question, assume that all deliveries are considered successful by the second attempt. Basic Concepts Each orderid represents a distinct transaction on Shopee. SLA can vary across each route (A route is defined as Sellers Location to Buyers Location) - Refer to SLA_matrix.xlsx Pick Up Time is defined as the time when the 3PL picks up the parcel and begins to process for delivery. It marks the start of the SLA calculation. Delivery Attempt is defined as an attempt made by the 3PL to deliver the parcel to the customer. It may or may not be delivered successfully. In the case when it is unsuccessful, a 2nd attempt will be made. A parcel that has no 2nd attempt is deemed to have been successfully delivered on the 1st attempt. All time formats are stored in epoch time based on Local Time (GMT+8). Only consider the date when determining if the order is late; ignore the time. Working Days are defined as Mon - Sat, Excluding Public Holidays. SLA calculation begins from the next day after pickup (Day 0 = Day of Pickup; Day 1 = Next Day after Pickup) 2nd Attempt must be no later than 3 working days after the 1st Attempt, regardless of origin to destination route (Day 0 = Day of 1st Attempt; Day 1 = Next Day after 1st Attempt). Only consider the date when determining if the order is late; ignore the time. Assume the following Public Holidays: 2020-03-08 (Sunday); 2020-03-25 (Wednesday); 2020-03-30 (Monday); 2020-03-31 (Tuesday) Submission Format Check each delivery order and determine whether it is late. Two columns required: orderid. is_late: assign value 1 if the order is late, otherwise 0. orderid is_late 1955512445 0 1955598428 1 Your submission should have 3,176,313 rows (excluding headers), each with 2 columns. Tips: 1) You are advised to run your tests on a sample of the dataset first. 2) If you are unable to solve the entire problem within the time limit, create the output csv with the required number of columns and rows based on a subset of the problem first.`'",tabular data,[Open] Shopee Code League - Logistics,inClass,,matthewscorrelationcoefficient,[open]-shopee-code-league-logistics 1040,'`kompetisi yang dibuat untuk mengasah kemampuan anggota KD I-RICH`',tabular data,I-RICH ML COMPETITION,inClass,,mae,i-rich-ml-competition 1041,"'`This competition is part of the ""Deep Learning with PyTorch: Zero to GANs"" live online course. In this competition, you will develop models capable of classifying mixed patterns of proteins in microscope images. Images visualizing proteins in cells are commonly used for biomedical research, and these cells could hold the key for the next breakthrough in medicine. However, thanks to advances in high-throughput microscopy, these images are generated at a far greater pace than what can be manually evaluated. Therefore, the need is greater than ever for automating biomedical image analysis to accelerate the understanding of human cells and disease. This is a multilabel image classification problem, where each image can belong to several classes. The class labels are as follows: 0: 'Mitochondria', 1: 'Nuclear bodies', 2: 'Nucleoli', 3: 'Golgi apparatus', 4: 'Nucleoplasm', 5: 'Nucleoli fibrillar center', 6: 'Cytosol', 7: 'Plasma membrane', 8: 'Centrosome', 9: 'Nuclear speckles' Acknowledgements The data from the competition has been taken from the Human Protein Atlas Image Classification competition on Kaggle`'",,Zero to GANs - Human Protein Classification,,,macrofscore,zero-to-gans-human-protein-classification 1042,"'`Introduction This is the home page of the KaggleDays Paris competition. Purpose You are provided with data about the first seven days of Louis Vuitton products after their launch on www.louisvuitton.com. Your goal is to forecast sales in the next three months separately (you have to make three predictions for one item). You can use product descriptions, sales, social media, website navigation, and image data to help Louis Vuitton to predict and organize new products sales in the coming months. Sponsors with the support of: Organizers`'",,KaggleDays Paris,,KaggleDays Paris Jan 26th 2019 Competition,rmsle,kaggledays-paris 1043,'`Let's try to predict the price of a product based on a collection of over one hundred thousand reviews and other product features.`',,Kaggle days meetup Ariana - Tunisia,,1 day competition ,rmse,kaggle-days-meetup-ariana-tunisia 1044,"'` kaggle . raws & . Evaluation kay.keun@kakaocommerce.com .`'",,[] ,,,mape,[]---- 1045,"'` ? . 1994 . . , , 14 feature . . $50,000 1 $50,000 0 . . Overview : , , , Data : Notebooks : Discussion : , , Leaderboard : Rules : `'",,[T-Academy X KaKr] ,,,meanfscore,[t-academy-x-kakr]----- 1046,"'`Merhabalar yapay zeka eitimlerinin 6.s, Giriim Odakl Yapay Zeka Eitimi'nin bavuru sorusuna hogeldiniz! Bu yarmada sizden beklentimiz rnlere ait kalori, eker, karbonhidrat gibi deerlere bakarak bu rnn kategorisini (bitkisel/hayvansal/mix) tahmin etmenizdir. Bu yarmann sizin iin zor olabileceini biliyoruz. Yapabilmeniz kadar nemli olan ey, ne kadar uratnzdr. Elinizden geleni yapmanz ve oluturduunuz Jupyter Notebook(.ipynb) ktsn bavuru formuna eklemenizi bekliyoruz. Not: Veriyi indirebilmek iin bavuru formundaki linkten yarmaya katlmanz gerekiyor. Bavuru Formu KaVe Kimdir? Karmak sistemler ve veri bilimi konularnn ok nemli olduuna, bu iki konunun birlikte ele alnmas gerektiine ve bu konularn neminin gelecekte de giderek artacana inanyoruz. Yarnn giderek karmaklaan veriye dayal yapay zeka teknolojilerini bugnden aratryoruz. Trkiyede dzenlediimiz nc eitimler ve akademik altaylar ile kendi i/akademik ekibimizi kurmak ve teknolojiyi/bilimi reten insanlar arasnda yerimizi almak iin aba sarfediyoruz. Ayrca bol bol kahve iiyoruz. Daha Fazla Detay iin Sosyal Medya Hesaplarmza Gz Atabilirsiniz. Linkedin Medium Twitter`'",,KaVe Eitimi Kabul Hackathonu v2,,,categorizationaccuracy,kave-eitimi-kabul-hackathonu-v2 1047,"'`Introduction (KISTI) ""2020 "" . Academic , . Competition background RMS . 1912 4 15 , , 2,224 1,502 . , . . , , . , . Acknowledgement . (KISTI) `'",,KISTI Kaggle Competition (3RD),,,categorizationaccuracy,kisti-kaggle-competition-(3rd) 1048,"'`In this competition, you're presented with metadata on over 1000 past films from The Movie Database to try and predict their quality: are they good or bad. Data points provided include cast, keywords, budget, release dates, languages, production companies, countries, and so on. Use only provided data. Solutions based on extra data will be banned You are able to use only four ML techniques: Logistic Regression, Decision Tree, Random Forest and K-Nearest-Neighbors, or they simple ensembles: manual predict_proba averaging or Voting Classifier. Other stacking, blending, boosting techniques are restricted and will be banned Good luck, fellows! `'",,KPIIT Movies Rating Prediction,,,categorizationaccuracy,kpiit-movies-rating-prediction 1049,"'`Logistics Performance Due to the recent COVID-19 pandemic across the globe, many individuals are increasingly turning to online platforms like Shopee to purchase their daily necessities. This surge in online orders has placed a strain onto Shopee and our logistics providers but customer expectations on the timely delivery of their goods remain high. On-time delivery is arguably one of the most important factors of success in the eCommerce industry and now more than ever, we need to ensure the orders reach our buyers on time in order to build our users confidence in us. In order to handle the millions of parcels that need to be delivered everyday, we have engaged multiple logistics providers across the region. Only the best logistics providers that are able to meet Shopees delivery standards are partnered with us. The performance of these providers is monitored regularly and each provider is held accountable based on the Service Level Agreements (SLA). Late deliveries are flagged out and penalties are imposed on the providers to ensure they perform their utmost. The consistent monitoring and process of holding our logistics providers accountable allows us to maintain our promise of timely deliveries to our buyers. Task Identify all the orders that are considered late depending on the Service Level Agreements (SLA) with our Logistics Provider. For the purpose of this question, assume that all deliveries are considered successful by the second attempt. Basic Concepts Each orderid represents a distinct transaction on Shopee. SLA can vary across each route (A route is defined as Sellers Location to Buyers Location) - Refer to SLA_matrix.xlsx Pick Up Time is defined as the time when the 3PL picks up the parcel and begins to process for delivery. It marks the start of the SLA calculation. Delivery Attempt is defined as an attempt made by the 3PL to deliver the parcel to the customer. It may or may not be delivered successfully. In the case when it is unsuccessful, a 2nd attempt will be made. A parcel that has no 2nd attempt is deemed to have been successfully delivered on the 1st attempt. All time formats are stored in epoch time based on Local Time (GMT+8). Only consider the date when determining if the order is late; ignore the time. Working Days are defined as Mon - Sat, Excluding Public Holidays. SLA calculation begins from the next day after pickup (Day 0 = Day of Pickup; Day 1 = Next Day after Pickup) 2nd Attempt must be no later than 3 working days after the 1st Attempt, regardless of origin to destination route (Day 0 = Day of 1st Attempt; Day 1 = Next Day after 1st Attempt). Only consider the date when determining if the order is late; ignore the time. Assume the following Public Holidays: 2020-03-08 (Sunday); 2020-03-25 (Wednesday); 2020-03-30 (Monday); 2020-03-31 (Tuesday) Submission Format Check each delivery order and determine whether it is late. Two columns required: orderid. is_late: assign value 1 if the order is late, otherwise 0. orderid is_late 1955512445 0 1955598428 1 Your submission should have 3,176,313 rows (excluding headers), each with 2 columns. Tips: 1) You are advised to run your tests on a sample of the dataset first. 2) If you are unable to solve the entire problem within the time limit, create the output csv with the required number of columns and rows based on a subset of the problem first.`'",,[Students] Shopee Code League - Logistics,,,matthewscorrelationcoefficient,[students]-shopee-code-league-logistics 1050,"'`LOGOS Speech Recognition Coding Challenge Imagine if you or a loved one could self-check brain function at any time using an app on your phone. This is becoming reality with the development of LOGOS, an automated telephone procedure designed to assess verbal memory. But for this to reach its potential, single spoken word-to text technology needs to improve to >95% accuracy. Natural language speech-to-text technologies (e.g., Siri and Google API) are excellent when they can mine context and usage, but their performance on single isolated words is dismal. This competition is therefore looking for innovative solutions to a discrete problem: Aim: To detect a set of 15 random words from within a .mp3 or .wav audio stream. This is your chance to revolutionise the way cognitive disorders like dementia are managed in the future, since early detection and intervention is key. Background and Significance Dementia is one of the great medical and social challenges of our times. An estimated 55 million are already affected, and worldwide a new person is diagnosed every 3 seconds. In the UK dementia has already become the leading cause of premature death. With the ageing of populations worldwide, this is set to become an even greater challenge. Memory impairment is central to the diagnosis of dementia and traditionally this has involved paper and pencil tests administered by a psychologist. This gold standard produces very high quality data but this service is expensive and can be difficult to access even in high income countries like Australia (e.g., in rural and regional communities). LOGOS aims to help address this by running an 100% automated verbal memory test over a users phone. We have validated LOGOS against gold standard memory tests in a specialist setting, which requires an operator to listen to LOGOS audio files and manually transcribe to text. This is clearly not feasible if LOGOS is to be deployed across thousands of people or potentially the whole population. Therefore for the full potential of LOGOS to be realised an accurate automated method of speech-to-text transcription is required.`'",,LOGOS Speech Recognition Challenge (USYD),inClass,Coding challenge to build a speech recognition system for dementia research,mae,logos-speech-recognition-challenge-(usyd) 1051,"'` + `'",,+ ,,,categorizationaccuracy,+- 1052,"'`Description This is the competition for students who desires to join the medical image processing lab of Innopolis U. The task is to classify the patches (crops) of lungs regions without pathology (normal) and pathology regions (fibrosis). The patches are from different patients. The test set contains patches from patients different from the train set. The size of patches is 64x64 px. Quick start Just fork or download the baseline kernel. It contains the whole pipeline including loading data, feature extraction, preparing a submit file. Change whatever you want, be creative. Proposed solutions Features extraction and supervised learning. The coarse pipeline of the approach is to manually extract features from images like it was performed in the baseline kernel for GLCM features. The next step is fitting classifier algorithm. Neural networks. Instead of manual feature extraction you can use different kinds of neural networks to perform the classification task. About results Please don't expect very high score, the task is complex and patches may be hardly separable. Please don't be shy to submit you predictions and kernels, we will take a look at your solution. The score is not so important as the way you think and solve tasks.`'",,Lungs patches classification,inClass,Image classification task on normal and pathology (fibrosis) lungs.,categorizationaccuracy,lungs-patches-classification 1053,'`Insert text here. Please use markdown.`',,LyDzC4rN3f67W7z,inClass,[200227] Pingpong AI Research - 윤승원님,categorizationaccuracy,lydzc4rn3f67w7z 1054,"'`Classification d'images biologiques FIB/SEM Challenge L'imagerie de microscopie lectronique balayage faisceau d'ions focalis, ou FIB/SEM, est une modalit permettant d'obtenir des images 3D trs haute rsolution. Elle combine un faisceau ionique focalise (FIB) qui decoupe une couche mince a la surface d'un echantillon et un faisceau delectrons a balayage (SEM) qui image la face du bloc. Dans le contexte biologique, cette modalit permet dimager des zones de 40 x 40 m avec une resolution spatiale isotrope de 3 nm sur une profondeur de plusieurs dizaines de micrometres, ce qui offre la possibilit de visualiser de nombreux objets cellulaires. L'imagerie FIB/SEM produit des images de grande taille, de l'ordre de 2048 x 1536 x 500. L'annotation manuelle des diffrents objets biologiques contenus dans une telle image est trs longue et fastidieuse : il est donc crucial de s'appuyer sur des techniques automatises. Coupe d'une image FIB/SEM d'un chantillon biologique. Les mitochondries sont annotes en trait noir En rouge : mitochondries. En vert : reticulum endoplasmique. En bleu : membrane nuclaire. En jaune : membrane cellulaire Dans ce contexte, ce challenge propose de classifier des imagettes 2D d'images FIB/SEM selon 5 classes : {autre, mitochondrie, rticulum endoplasmique, membrane nuclaire, membrane cellulaire}. Plus prcisment, il s'agit d'identifier la classe du pixel central de l'imagette. Les 5 classes ont pour identifiants : classe 0 : le pixel central n'appartient aucune classe annote classe 1 : le pixel central appartient une mitochondrie classe 2 : le pixel central appartient un reticulum endoplasmique classe 3 : le pixel central appartient la membrane nuclaire (membrane dlimitant l'intrieur du noyau) classe 4 : le pixel central appartient la membrane cellulaire (membrane dlimitant l'intrieur de la cellule) Rendu notebook Kaggle ou notebook Colab : pensez me donner les droits votre code doit tre comment compte-rendu de projet (intgr dans le notebook) justifiant la mthode utilise et les choix effectus Remerciements Nous remercions Patrick Schultz, responsable de l'quipe de biologie structurale de l'IGBMC, pour fournir ces images et accepter leur diffusion dans le cadre de ce challenge.`'",,M1 I3D FSSD,inClass,Competition,categorizationaccuracy,m1-i3d-fssd 1055,"'`The challenge is to build a machine learning model to predict the type of skin cancer out of 5 different types. 1 - Melanoma (MEL) 2- Melanocytic nevus (NV) 3- Basal cell carcinoma (BCC) 4- Actinic keratosis (AK) 5- Benign keratosis (BKL)`'",,Skin Cancer Classification,inClass,The task is to differentiate between five different types of skin cancer.,categorizationaccuracy,skin-cancer-classification 1056,"'` 971 . , , - , , . : https://github.com/BorisLestsov/MADE/tree/master/contest1/unsupervised-landmarks-thousand-landmarks-contest : -5 (50). 6 , , 49 1 ; . , , , 1 . 0. , ( Notebooks). GridSearch , : ( , , , ) (EDA, , , ) , . ( - torchvision ). , , - ? ! Evaluation metric: MSE Acknowledgements We thank Daniel Lysukhin for providing this dataset.`'",,Thousand Facial Landmarks,,,mse,thousand-facial-landmarks 1057,"'`There's a reason why it's called March Madness. Upsets happen, underdogs become ""cinderellas,"" and games that analysts expected to be blowouts become nail-biters through the final seconds. A team's competitiveness is what keeps games exciting and the tournament truly ""mad."" In addition to the predictive modeling competitions we typically host (NCAA Men's and Womens), we are hosting a separate competition using Kaggle Notebooks that challenges you to present an exploratory analysis of the Madness. Can you quantify competitiveness? Can you explain ""cinderellaness""? Or perhaps, can you determine what dictates the ability of a team to stay in the game and increase their chance to win late in the contest? This may or may not be a scalar metric. It might be a clustering of types of competitiveness and then a rating within each. Does this metric have predictive power? The interpretation is up to you. Your challenge is to tell a data story about college basketball through a combination of both narrative text and data exploration. A story could be defined any number of ways, and thats deliberate. You are to deeply explore (through data) the mania of the Mens and Womens NCAA College Basketball tournaments. That story can be examined in the macro (for example: How does competitiveness differ from the regular season to their decisions in the tournament?) or the micro (for example: Does effectively neutralizing an opponents star players increase their ability to stay in the game?). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!`'",,Google Cloud & NCAA March Madness Analytics,,,logloss,google-cloud-&-ncaa-march-madness-analytics 1058,"'`DeepLearning y=f(x) `'",,DeepLearning,,,auc,deeplearning 1059,"'`Sentiment analysis also known as opinion mining is a subfield within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive or negative. In recent year, sentiment analysis has become a topic of great interest and development in both academics and industry. Analysing the sentiment of texts could benefit, for example, customer services, product analytics, market research etc. Take Ebay as an example. Customers on Ebay choose their preferred products based on the reviews from other users. an automatic sentiment classification system can not only help companies grasp the satisfaction level of the products, but also significantly assist new customers to locate their online shopping shelves. In this data analysis challenge, we are interested in developing such an automatic sentiment classification system that relies on machine learning techniques to learn from a large set of product reviews provided by Yelp. The levels of polarity of opinion we consider include **negative, neutral and positive**. For example, Website says open, Google says open, Yelp says open on Sundays. Our delivery was cancelled suddenly and no one is answering the phone. Shame gives us a negative sentiment, whereas the sentiment of They have great food & definitely excellent service. Tried their mochi mango flavored and it s definitely delis is likely to be positive. MDSS would like to thank Dr. Lan Du, Ming Liu for providing us the data set to host this competition. We would also like to thank our sponsors Faculty of IT, Monash University and National Australian Bank(NAB) for their support to making this event possible.`'",,MDSS Datathon,inClass,Monash Data Science Society(MDSS) datathon,categorizationaccuracy,mdss-datathon 1060,"'`Forecasting Task Students to use data provided in Columns B through J of the training and test datasets to generate a probabilistic forecast of acceptance/rejection for each of 191 medical school applicants of the test dataset. Students need to only submit their forecast for each candidate's acceptance(probability) into the school. Each forecast must be greater than zero and less than one. Students submitting one or more forecasts less than or equal to zero, or greater than one, will be automatically disqualified. Note that the base rate of acceptance varies for the Training and Test data sets. The base rate for the training data set is 0.17(17%) and for the test data set is 0.21(21%). This varying base rate may be taken into account when calculating individual forecasts. Privacy Issue These data-sets were provided to our class with the understanding that they will not be shared outside of class. Students may not share the data-sets with anybody outside the class.`'",,Medical School Admissions,inClass,Forecast probability of admission for each student given known base rate,logloss,medical-school-admissions 1061,"'`Business Objective Current evidences show the advantages of medicine reviews by users for the safe and effective use of medicines. This data set provides the opinions of consumers about their conditions and the medicines that they have used. This product could be helpful for companies like 1mg to provide datailed rating of the side effects of the product over their site. It could also be helpful for the patients who are buying drugs online to check the side effects of the drugs before buying it.`'",,Medicine Side-Effects Analysis,,,categorizationaccuracy,medicine-side-effects-analysis 1062,"'` - Megogo - - . , . 2 : , 22:00 , 22:00 , 5 . . , 3 . . 2 , . , , 1 3, 3 , , . , . . (, , ) Megogo 7, 8 9 2018 : : - - - 10 -, 10 2018 . , watching_percentage >= 0.5. . Evaluation. Telegram Slack !`'",,Megogo Challenge,,,map@{k},megogo-challenge 1063,"'`Melinia 2020 - Datathon Competition Description A data scientist is someone who can obtain, scrub, explore, model, and interpret data, blending hacking, statistics, and machine learning. Melinia 2020 presents Datathon, an event for data scientists in which participants will work on building a machine learning model with the data provided specifically to a problem statement evaluated against the accuracy metric Nourishment Quality Appraisal The nourishment review division conducts customary assessment on nourishment quality for different eateries within the city. Its an awfully well-archived procedure and over time a few great sums of data have been created out of these reviews. The review office would like to predict where they should focus most in terms of their following review plan so that they can most optimize their time at hand to capture the most exceedingly bad guilty parties. Can the past assessment or any information that they have collected to anticipate which assessment will pass or fail? The data on nourishment quality checks conducted on offices that serve nourishment over numerous cities. Your objective is to predict whether an appraisal will pass or come up short the assessment based on a number of components.`'",,Nourishment Quality Appraisal,,,categorizationaccuracy,nourishment-quality-appraisal 1064,"'`Google Cloud and NCAA have teamed up to bring you this years version of the Kaggle machine learning competition. Another year, another chance to anticipate the upsets, call the probabilities, and put your bracketology skills to the leaderboard test. Kagglers will join the millions of fans who attempt to forecast the outcomes of March Madness during this year's NCAA Division I Mens and Womens Basketball Championships. But unlike most fans, you will pick your bracket using a combination of NCAAs historical data and your computing power, while the ground truth unfolds on national television. In the first stage of the competition, Kagglers will rely on results of past tournaments to build and test models. We encourage you to post any useful external data as a dataset. In the second stage, competitors will forecast outcomes of all possible match-ups in the 2018 NCAA Division I Mens and Womens Basketball Championships. You don't need to participate in the first stage to enter the second. The first stage exists to incentivize model building and provide a means to score predictions. The real competition is forecasting the 2018 results. This page is for the NCAA Division I Men's tournament. Check out the NCAA Division I Women's tournament here.`'",,Google Cloud & NCAA ML Competition 2018-Men's,,,logloss,google-cloud-&-ncaa-ml-competition-2018-mens 1065,"'`As a result of the continued collaboration between Google Cloud and the NCAA, the sixth annual Kaggle-backed March Madness competition is underway! Another year, another chance to anticipate the upsets, call the probabilities, and put your bracketology skills to the leaderboard test. Kagglers will join the millions of fans who attempt to forecast the outcomes of March Madness during this year's NCAA Division I Mens and Womens Basketball Championships. But unlike most fans, you will pick your bracket using a combination of NCAAs historical data and your computing power, while the ground truth unfolds on national television. In the first stage of the competition, Kagglers will rely on results of past tournaments to build and test models. We encourage you to post any useful external data as a dataset. In the second stage, competitors will forecast outcomes of all possible matchups in the 2019 NCAA Division I Mens and Womens Basketball Championships. You don't need to participate in the first stage to enter the second. The first stage exists to incentivize model building and provide a means to score predictions. The real competition is forecasting the 2019 results. As the official public cloud provider of the NCAA, Google is proud to provide a competition to help participants strengthen their knowledge of basketball, statistics, data modeling, and cloud technology. As part of its journey to the cloud, the NCAA has migrated 80+ years of historical and play-by-play data, from 90 championships and 24 sports, to Google Cloud Platform (GCP). The NCAA has tapped into decades of historical basketball data using BigQuery, Cloud Spanner, Datalab, Cloud Machine Learning and Cloud Dataflow, to power the analysis of team and player performance. The mission of the NCAA has long been about serving the needs of schools, their teams and students. Google Cloud is proud to support that mission by helping the NCAA use data and machine learning to better engage with its millions of fans, 500,000 student-athletes, and more than 19,000 teams. Game on! This page is for the NCAA Division I Men's tournament. Check out the NCAA Division I Women's tournament here.`'",,Google Cloud & NCAA ML Competition 2019-Men's,,,logloss,google-cloud-&-ncaa-ml-competition-2019-mens 1066,"'` , (, , ). . . , - - . , ! - accuracy`'",,Mephi-ApplPy-classif_test1,inClass,Первое соревнование. Классификация объявлений на K классов,categorizationaccuracy,mephi-applpy-classif_test1 1067,"'`Kaggle is a platform for data science competitions. As a methods club, we don't really need a competition, but it is helpful to have a standardized dataset that we're all using and a pipeline in place for us to build lots of networks and easily check how we're doing. Kaggle allows us to do those things. The files in the ""Data"" section contain the CIFAR-10 image dataset and a very basic example of a network that learns to classify these images. I recommend downloading all the files in that section and running the Jupyter notebook ""CIFAR-10 Classification.ipynb"". That will generate a file called ""predictions.csv"" that you can try submitting to Kaggle. Once you've successfully made a submission, you can swap the example network out for your own! I recommend we try to start making submissions early and often (even if the network performance is horrendous). You can make up to 20 submissions a day and the leaderboard will only reflect the performance of your best network to-date. Let's build some networks!`'",,Methods Club: Classifying CIFAR-10,inClass,A simple image dataset to practice building neural networks for classification.,categorizationaccuracy,methods-club:-classifying-cifar-10 1068,"'`El hundimiento del SS Leviatn El hundimiento del SS Leviatn es uno de los hundimientos ms famosos de la historia. El 15 de mayo de 1894, durante uno de sus viajes, el Leviatn se hundi despus de chocar contra una ballena, dejando 1502 muertos de 2224 tripulantes, entre pasajeros y personal del barco. Esta tragedia hizo que desde ese momento en adelante se endurecieran las medidas de seguridad de los barcos. Una de las razones por las que el hundimiento llevo a la muerte de tantas personas es que no haba suficientes botes salvavidas para los pasajeros y el personal. Aunque salvarse dependi, en cierta medida, de la suerte, algunos grupos de personas tuvieron ms posibilidades de sobrevivir que otros, como las mujeres, los nios, o la clase alta. En esta competicin te pedimos que analices e intentes predecir la probabilidad de supervivencia, dadas las variables que te proporcionamos.`'",,El hundimiento del SS Leviatan,inClass,¿Puedes predecir quiénes tenían más probabilida de sobrevivir?,auc,el-hundimiento-del-ss-leviatan 1069,"'`General Instructions This homework is easy and will get you started on tools for network analysis. If you find these questions difficult, then you might not be ready for this course. Submission instructions: Prepare answers to your homework in a single PDF file and submit it via GradeScope. Make sure that the answer to each sub-question is on a separate, single page. The number of the question should be at the top of each page. Use the submission template files included in the bundle to prepare your submission. Fill out the information sheet located at the end of the submission template file, and sign it in order to acknowledge the Honor Code (if typesetting the homework, you may type your name instead of signing). This should be the last page of your submission. Failure to fill out the information sheet will result in a reduction of 2 points from your homework score. Submitting code: Students also need to upload their code on Gradescope. Put all the code for a specific question into a single compressed file and upload it. Questions The purpose of these exercises is to get you started with network analysis and the SNAP software. For this homework, you need to install and try out the SNAP network analysis tools. We strongly encourage you to use Snap.py for Python (available from http://snap.stanford.edu/snappy/). However, you can also use SNAP for C++ (available from http://snap.stanford.edu/snap/download.html) 1 Analyzing the Wikipedia voters network [27 points] Download the Wikipedia voting network wiki-Vote.txt.gz: http://snap.stanford.edu/ data/wiki-Vote.html. Using one of the network analysis tools above, load the Wikipedia voting network. Note that Wikipedia is a directed network. Formally, we consider the Wikipedia network as a directed graph G = (V, E), with node set V and edge set E V V where (edges are ordered pairs of nodes). An edge (a, b) E means that user a voted on user b. To make our questions clearer, we will use the following small graph as a running example: G small = (V small , E small ), where V small = {1, 2, 3} and E small = {(1, 2), (2, 1), (1, 3), (1, 1)}. Compute and print out the following statistics for the wiki-Vote network: The number of nodes in the network. (G small has 3 nodes.) The number of nodes with a self-edge (self-loop), i.e., the number of nodes a V where (a, a) E. (G small has 1 self-edge.) The number of directed edges in the network, i.e., the number of ordered pairs (a, b) E for which a 6 = b. (G small has 3 directed edges.) The number of undirected edges in the network, i.e., the number of unique unordered pairs (a, b), a 6 = b, for which (a, b) E or (b, a) E (or both). If both (a, b) and (b, a) are edges, this counts a single undirected edge. (G small has 2 undirected edges.) The number of reciprocated edges in the network, i.e., the number of unique unordered pairs of nodes (a, b), a 6 = b, for which (a, b) E and (b, a) E. (G small has 1 reciprocated edge.) The number of nodes of zero out-degree. (G small has 1 node with zero out-degree.) The number of nodes of zero in-degree. (G small has 0 nodes with zero in-degree.) The number of nodes with more than 10 outgoing edges (out-degree > 10). The number of nodes with fewer than 10 incoming edges (in-degree < 10). Each sub-question is worth 3 points. 2 Further Analyzing the Wikipedia voters network [33 points] For this problem, we use the Wikipedia voters network. If you are using Python, you might want to use NumPy, SciPy, and/or Matplotlib libraries. (18 points) Plot the distribution of out-degrees of nodes in the network on a log-log scale. Each data point is a pair (x, y) where x is a positive integer and y is the number of nodes in the network with out-degree equal to x. Restrict the range of x between the minimum and maximum out-degrees. You may filter out data points with a 0 entry. For the log-log scale, use base 10 for both x and y axes. (15 points) Compute and plot the least-square regression line for the out-degree distribution in the log-log scale plot. Note we want to find coefficients a and b such that the function log 10 y = a log 10 x + b, equivalently, y = 10 b x a , best fits the out-degree distribution. What are the coefficients a and b? For this part, you might want to use the method called polyfit in NumPy with deg parameter equal to 1. 3 Finding Experts on the Java Programming Language on StackOverflow [40 points] Download the StackOverflow network stackoverflow-Java.txt.gz: http://snap.stanford. edu/class/cs224w-data/hw0/stackoverflow-Java.txt.gz. An edge (a, b) in the network means that person a endorsed an answer from person b on a Java-related question. Using one of the network analysis tools above, load the StackOverflow network. Note that StackOverflow is a directed network. Compute and print out the following statistics for the stackoverflow-Java network: The number of weakly connected components in the network. This value can be calculated in Snap.py via function GetWccs. The number of edges and the number of nodes in the largest weakly connected component. The largest weakly connected component is calculated in Snap.py with function GetMxWcc. IDs of the top 3 most central nodes in the network by PagePank scores. PageRank scores are calculated in Snap.py with function GetPageRank. IDs of the top 3 hubs and top 3 authorities in the network by HITS scores. HITS scores are calculated in Snap.py with function GetHits. Each sub-question is worth 10 points. You can find more details about this exercise on the Snap.py tutorial page: http://snap.stanford. edu/proj/snap-icwsm/. As an extra exercise, extend the tutorial to find experts in other program- ming languages or topics.`'",,ML in Graphs: HW0,inClass,ML in Graphs: HW0,auc,ml-in-graphs:-hw0 1070,"'`General Instructions These questions require thought, but do not require long answers. Please be as concise as possible. You are allowed to take a maximum of 1 late period seetheinformationsheetattheendofthisdocumentforthedefinitionofalateperiod. Submission instructions: You should submit your answers and code via Gradescope. There will be a separate submission assignment for written and code. Submitting answers: Prepare answers to your homework in a single PDF file and submit it via Gradescope to the HW1 Written assignment. Make sure that the answer to each sub-question is on a separate, single page. The number of the question should be at the top of each page. Please use the submission template files included in the bundle to prepare your submission. Failure to use the submission template file will result in a reduction of 2 points from your homework score. Information sheet: Fill out the information sheet located at the end of the submission template file, and sign it in order to acknowledge the Honor Code iftypesettingthehomework,youmaytypeyournameinsteadofsigning. This should be the last page of your submission. Failure to fill out the information sheet will result in a reduction of 2 points from your homework score. Submitting code: Upload your code on Gradescope to the HW1 Code assignment. Put the code for each question in a separate single Python file named q questionnumber .py. Forexample,allcodeforallpartsofquestion1shouldgoinfileq1.py. Then upload a zip file containing all the Python files via Gradescope to the HW1 Code assignment. Failure to submit your code will result in reduction of all points for that question from your homework score. Homework survey: After submitting your homework, please fill out the Homework 1 Feedback Form. Respondents will be awarded extra credit. Questions 1 Network Characteristics 25points One of the goals of network analysis is to find mathematical models that characterize real-world networks and that can then be used to generate new networks with similar properties. In this problem, we will explore two famous modelsErds-Rnyi and Small Worldand compare them to real-world data from an academic collaboration network. Note that in this problem all networks are undirected. You may use the starter code in hw1-q1-starter.py for this problem. Erds-Rnyi Random graph G(n,m random network): Generate a random instance of this model by using n = 5242 nodes and picking m = 14484 edges at random. Write code to construct instances of this model, i.e., do not call a SNAP function. Small-World Random Network: Generate an instance from this model as follows: begin with n = 5242 nodes arranged as a ring, i.e., imagine the nodes form a circle and each node is connected to its two direct neighbors e.g.,node399isconnectedtonodes398and400, giving us 5242 edges. Next, connect each node to the neighbors of its neighbors e.g.,node399isalsoconnectedtonodes397and401. This gives us another 5242 edges. Finally, randomly select 4000 pairs of nodes not yet connected and add an edge between them. In total, this will make m = 5242 2 + 4000 = 14484 edges. Write code to construct instances of this model, i.e., do not call a SNAP function. Real-World Collaboration Network: Download this undirected network from http://snap.stanford.edu/data/ca-GrQc.txt.gz. Nodes in this network represent authors of research papers on the arXiv in the General Relativity and Quantum Cosmology section. There is an edge between two authors if they have co-authored at least one paper together. Note that some edges may appear twice in the data, once for each direction. Ignoring repeats and self-edges, there are 5242 nodes and 14484 edges. Note:Repeatsareautomaticallyignoredwhenloadingan(undirected graph with SNAPs LoadEdgeList function). 1.1 Degree Distribution 12points Generate a random graph from both the Erds-Rnyi i.e.,G(n,m) and Small-World models and read in the collaboration network. Delete all of the self-edges in the collaboration network thereshouldbe14,484totaledgesremaining. Plot the degree distribution of all three networks in the same plot on a log-log scale. In other words, generate a plot with the horizontal axis representing node degrees and the vertical axis representing the proportion of nodes with a given degree byloglogscalewemeanthatboththehorizontalandverticalaxismustbeinlogarithmicscale. In one to two sentences, describe one key difference between the degree distribution of the collaboration network and the degree distributions of the random graph models. 1.2 Clustering Coefficient 13points Recall that the local clustering coefficient for a node v i was defined in class as \[ C_{i}=\left\{\begin{matrix} \frac{2\left | e_{i} \right |}{k_{i}\cdot (k_{i}-1)} & k_{i}\geq 2 \\ 0 & otherwise, \end{matrix}\right. \] where k i is the degree of node v i and e i is the number of edges between the neighbors of v i . The average clustering coefficient is defined as \[ C = \frac{1}{\left | V \right |}\sum_{i\epsilon V }^{}C_{i}\] Compute and report the average clustering coefficient of the three networks. For this question, write your own implementation to compute the clustering coefficient, instead of using a built-in SNAP function. Which network has the largest clustering coefficient? In one to two sentences, explain. Think about the underlying process that generated the network. What to submit Page 1: Log-log degree distribution plot for all three networks insameplot One to two sentences description of a difference between the collaboration networks degree distribution and the degree distributions from the random graph models. Page 2: Average clustering coefficient for each network. Network that has the largest average clustering coefficient. One to two sentences explaining why this network has the largest average clustering coefficient. 2 Structural Roles: Rolx and ReFex 25points In this problem, we will explore the structural role extraction algorithm Rolx and its recursive feature extraction method ReFex. As part of this exploration, we will work with a dataset rep- resenting a scientist co-authorship network, which can be dowloaded at http://www-personal.umich.edu/~mejn/netdata/netscience.zip. 1 Although the graph is weighted, for simplicity we treat it as undirected and unweighted in this problem. Feature extraction consists of two steps; we first extract basic local features from every node, and we subsequently aggregate them along graph edges so that global features are also obtained. Collectively, feature extraction constructs a matrix V R nf where for each of the n nodes we have f features to cover local and global information. Rolx extracts node roles from that matrix. 2.1 Basic Features 5points We begin by loading the graph G provided in the bundle and computing three basic features for the nodes. For each node v, we choose 3 basic local features inthisorder: the degree of v, i.e., degv; the number of edges in the egonet of v, where egonet of v is defined as the subgraph of G induced by v and its neighborhood; the number of edges that connect the egonet of v and the rest of the graph, i.e., the number of edges that enter or leave the egonet of v. We use \[ \widetilde{V}_{u}\] to represent the vector of the basic features of node u. For any pair of nodes u and v, we can use cosine similarity to measure how similar two nodes are according to their feature vectors x and y: \[ Sim(x,y)=\frac{x\cdot y}{\left \| x \right \|_{2}\cdot\left \| y \right \|_{2}}=\frac{\sum_{i}x_{i}y_{i}}{\sqrt{\sum _{i}x_{i}^{2}}\cdot \sqrt{\sum _{i}y_{i}^{2}}}\] Also, when \[ \left \| x \right \|_{2} = 0 \] or \[ \left \| y \right \|_{2} = 0 \] , we defined \[ Sim(x, y) = 0 \] Compute the basic feature vector for the node with ID 9, and report the top 5 nodes that are most similar to node 9 excludingnode9. As a sanity check, no element in \[ \widetilde{V}_{9}\] is larger than 10. 2.2 Recursive Features 8points In this next step, we recursively generate some more features. We use mean and sum as aggregation functions. :) `'",,ML in Graphs: HW1,inClass,ML in Graphs: HW1,auc,ml-in-graphs:-hw1 1071,"'`Welcome to the Speech Recognition Challenge! An audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten target words, with as few false positives as possible from background noise or unrelated speech.`'",,ML Project - Speech Recognition Challenge,inClass,Implement a Speech Recognition System,categorizationaccuracy,ml-project-speech-recognition-challenge 1072,"'` Overview O2OOnline to OfflineO2O 10 O2O APP O2O Data 20161120165312016615 Evaluation AUCROC User_id - Date_received - Coupon_id AUC`'",,ml100marathon-02-01,inClass,Midterm exam for ML 100 marathon by Cupoy,auc,ml100marathon-02-01 1073,'`Based on https://www.kaggle.com/c/whats-cooking.`',,ML1819 - What's Cooking?,inClass,Use recipe ingredients to categorize the cuisine,categorizationaccuracy,ml1819-whats-cooking? 1074,"'`ML2020 UIBK Welcome to the submission page and leader board for the project in ML2020. You may need to join this competition first. Guideline: You will find the example submission file in the Data tab. Please generate the Submission.csv in your own machine and upload it via the submission button on the top-right of the page. Kaggle also provides a free jupyter-notebook service. The submissions may be a bit of difference. If you want to use Kaggle, please refer to the submission details to the Kaggle tutorial.`'",,ML2020 UIBK,,,macrofscore,ml2020-uibk 1075,"'`Wine (from Latin vinum) is an alcoholic beverage made from grapes, generally Vitis vinifera, fermented without the addition of sugars, acids, enzymes, water, or other nutrients. Wine has been produced for thousands of years. The earliest known traces of wine are from Georgia (cca. 6000 BC), Iran (cca. 5000 BC), and Sicily (cca. 4000 BC) although there is evidence of a similar alcoholic beverage being consumed earlier in China (cca. 7000 BC). The earliest known winery is the 6,100-year-old Areni-1 winery in Armenia. Wine reached the Balkans by 4500 BC and was consumed and celebrated in ancient Greece, Thrace and Rome. Throughout history, wine has been consumed for its intoxicating effects. Wine has long played an important role in religion. Red wine was associated with blood by the ancient Egyptians and was used by both the Greek cult of Dionysus and the Romans in their Bacchanalia; Judaism also incorporates it in the Kiddush and Christianity in the Eucharist. Yeast consumes the sugar in the grapes and converts it to ethanol and carbon dioxide. Different varieties of grapes and strains of yeasts produce different styles of wine. These variations result from the complex interactions between the biochemical development of the grape, the reactions involved in fermentation, the terroir, and the production process. Many countries enact legal appellations intended to define styles and qualities of wine. These typically restrict the geographical origin and permitted varieties of grapes, as well as other aspects of wine production. Wines not made from grapes include rice wine and fruit wines such as plum, cherry, pomegranate and elderberry. This work addresses the following issues concerning the quality of wine with respect to various chemical contents of and acids. The first issue concerns the correlation between different acids and quality of wine. In this work, we investigate what could be the best associate parameter on which best quality wine depends? Overview of the Study Our field study concerns quality of wine , produced all over the world. Wine is from one of the alcohol family and it is considered as a part of rich culture as well. There are various health benifits of wine (http://www.wideopeneats.com/10-health-benefits-get-drinking-daily-glass-wine/). There are a large number of occupations and professions that are part of the wine industry, ranging from the individuals who grow the grapes, prepare the wine, bottle it, sell it, assess it, market it and finally make recommendations to clients and serve the wine. In this study, we figure out important correlated chemical components of wine. Important acids which are associated with quality of wine. We will find out the right components which are needed in right ratio to make quality wine.`'",,MLDM Classification Competition 2020,,,meanfscore,mldm-classification-competition-2020 1076,"'`In this competition, you will on a knowyourmeme.com meme dataset, and predict a memes popularity via the number of views it gets!`'",,ML@DUB Competition #1,,,rmse,ml@dub-competition-#1 1077,"'`In this Challenge, you need to predict the price range of mobile phones given their specs like RAM , Camera Mega Pixels, Battery capacity, etc., Try different classification models. Use Cross Validation strategies. Tune your model's hyper-parameters. Submit your predictions. Note: You can make multiple submissions and the best Submission will be taken into consideration`'",,Mobile Price Range Prediction IS2020,,,categorizationaccuracy,mobile-price-range-prediction-is2020 1078,'`This dataset is derived from the customer reviews in Amazon Commerce Website for authorship identification. Most previous studies conducted the identification experiments for two to ten authors. We identified 50 of the most active users (represented by a unique ID and username) who frequently posted reviews in these newsgroups. The number of reviews we collected for each author is 30.`',,MSE-BB-2-SS2020-MLE-Amazon Reviews,,,categorizationaccuracy,mse-bb-2-ss2020-mle-amazon-reviews 1079,"'`This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).`'",,MSE-BB-2-SS2020-MLE-Congressional Voting,,,categorizationaccuracy,mse-bb-2-ss2020-mle-congressional-voting 1080,"'` "" "". https://github.com/Dyakonov/IML .`'",, 2 ,,,auc,--2--- 1081,"'` . , Kaggle, . Kaggle . . . , , . , . , . : Overview - . Data - , . Kernels - Kaggle. , . Leaderboard - . Private Public. 50% ( , ). (Private Leaderboard) , . . , . Rules - Team - .`'",, . ,,,categorizationaccuracy,-.- 1082,"'`Bei dieser Competition erhalten Sie Bilder in der Datei test.csv, welche Sie klassieren sollen. Die CSV-Datei enthlt in jeder Zeile (komma-separiert) die Graustufenwerte (uint8) der Pixel. Eine guter Start sollte folgender Code liefern: import pandas as pd import matplotlib.pyplot as plt %matplotlib inline df_train = pd.read_csv('train.csv',header=None) idx = df_train.iloc[:,0] target = df_train.iloc[:,1] pixels = df_train.iloc[:,2:] plt.imshow(pixels[50]) Auf den Bildern zu sehen sind Buchstaben A-H. Erzeugen Sie eine Textdatei submission.csv mit Ihren Vorhersagen in folgendem Format: Id,target 0,E 1,H ... 9363,A Eine Datei mit korrektem Format kann folgendermassen erzeugt werden: import pandas as pd df_test = pd.read_csv('test.csv',header=None,index_col=0) ser = pd.Series([0]*len(df_test),name='target'); classes = pd.Series(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']) random_prediction = ser.map(lambda x:classes.sample().values[0]) random_prediction.index.name='Id' random_prediction.to_csv('Kaggle_uploads/Random_Sample_Submission.csv',header=True) Laden Sie diese Datei auf Kaggle hoch, nachdem Sie die zuflligen Klassenlabels durch Ihre Vorhersagen ersetzt haben.`'",,Nicht MNIST!,,,categorizationaccuracy,nicht-mnist! 1083,"'`KBTG-CU Spelling Error Detection The joint collaboration research project between KBTG and Chulalongkorn University provides misspelling data from social media text. The position of the misspelled words in a sentence is given. Your job is to create a misspelled word detection model using the data provided. Any kinds of models/methods are welcome. However, you are required to provide a writeup for the methods you use, so make sure you can explain your solution. You are not allowed to use the test set in any way besides for evaluation. For example, you cannot use the test set to learn the dictionary. This data is provided only for educational purposes.`'",,KBTG-CU Spelling Error Detection,inClass,Thai Spelling Error Detection,normalizedgini,kbtg-cu-spelling-error-detection 1084,"'`Regression Task to predict a value from [0,3] to determine the degree of offence in the sentences. The required output can be a Double value ranging from 0 to 3 -> [0,3]. Use nlp_train.csv to train the model, and _nlp_test.csv to submit predictions, The sample submission file is the same as _nlp_test.csv, the submission to be made has columns: {tweet, offensive _ language}`'",,Text Regression - NNFL Lab 3,,,rmse,text-regression-nnfl-lab-3 1085,"'`The task is paraphrase detection. Given two sentences, detect whether they are paraphrases (i.e. same meaning) or not. Class = 1 corresponds to the sentences being paraphrases of each other. The train set contains 16000 samples and test set contains 4700 samples.`'",,NNFL Lab-4,,,categorizationaccuracy,nnfl-lab-4 1086,"'`WINNERS! - September 1 We are excited to announce the winners! Please follow the discussion thread for more detail of the evaluation. Please also leave your feedback HERE for possible next AI Challenge and the other tech-focused events. Diamond AI Award (the 1st place in the ranking) RMSE 0.03212 Peter Zhou - CTO, NTT DATA Cloud Greater China Ruby AI Award (the 2nd place in the ranking) DualShock Team: RMSE 0.04227 Dhruv Arora - Software Development Senior Associate, NTT Data Services Yatish Goel - Software Development Specialist Advisor, NTT Data Services Leo Wang - Systems Integration Advisor, NTT Data Services Emerald AI Award (the 3rd place in the ranking) RMSE: 0.05410 Makoto Tachibana - AI Platform Engineer, AI and IoT, NTT DATA Corporation Golden Notebook Award (The most referred notebook during the competition period) Story Telling - Animated EDA + Beginner's Model: 37 Non-Novice Votes Aravind Rajaelangovan - Software Development Senior Associate, NTT DATA Services Jegath Shebin XP - Software Development Senior Associate, NTT DATA Services Silver Notebook Award (The 2nd most referred notebook during the competition period) Satellite Image pre-process (Tif & Npy): 33 Non-Novice Votes Magesh T - Business Intelligence Analyst, NTT DATA Global Analytics, Business Process Services (BPO), NTT DATA Services ARIMA Model code & guide for submission: 33 Non-Novice Votes Ramprassath T C - Business Analysis Advisor, NTT DATA IPS Ltd. AR, MA, ARMA, ARIMA, Prophet, LSTM, RegModels: 33 Non-Novice Votes Masahiro Sajiki - Data Scientist, Digital Technology and Engineering Department, NTT DATA Corporation MESSAGE Head of Research and Development Headquarters Shunichi Amemiya To update the technological skills of employees and encourage the positive contributions to make a good society, NTT DATA GLOBAL AI Challenge gives you an opportunity to tackle challenging AI problems with the latest technology and collaborate with AI engineers beyond the borders. Today, the novel coronavirus is threatening the entire world and damaging our lives and economies. To overcome this global threat together, we are selecting this threat as the main topic and asking participants to create and optimize the AI model to solve COVID-19 related problems. The competition takes place on Kaggle.com, a well-known data scientists platform to explore, publish, and use data and AI models. You can check many exciting competitions are already going on Kaggle. We believe that this challenge gives you a chance to show your AI capability to our colleagues globally and to jump into this community. The Challenge The ongoing COVID-19 pandemic has caused worldwide disruption to society and to businesses globally. We want you to investigate how coronavirus pandemic has changed oil markets. Reducing the economic impact of the coronavirus has become a serious social challenge, as has the medical response. And if we can use satellite imagery to predict impacts, we can discover areas that are particularly in need of response and overlooked by the entire world. This time, we have set the price of oil as an indicator of the economic situation. Oil prices are known to be highly correlated with economic conditions, which we believe is a valid indicator of global economic conditions. In this challenge, we ask you to build a predictive model that answers the question: how COVID-19 is affecting oil markets? using NTL (Night-Time Lights) satellite images of four countries and COVID-19 data (ie infected, economic indicator, etc). Technical challenges include handling satellite images, applying image recognition and machine learning to satellite images, building machine learning into predictive models, tuning machine learning models for refinement of predictions, and incorporating external data. Because of the use of satellite imagery, the solutions built here can be put to immediate use around the world. It is also expected to lead to projects such as a project with the World Bank and the Asian Development Bank to analyze the effects of the corona. The Target of Competition The participants will be asked to predict the 75-day moving averages of WTI Oil Prices between July 16 to August 21, 2020, the future period after the competition. The challenge is the future price prediction at a certain time with any available data at that point. The prediction target The 75-day moving average of WTI Oil Prices: from July 16 to August 21, 2020. The data for the prediction We are planning to provide the below data. We will regularly update the data available on the timeline described in the next section. The 75-day moving average of WTI Oil Prices: from August 16, 1986, to July 6, 2020 The daily basis Night-Time Light (NTL) satellite images: from January 1 to July 6, 2020 The daily basis COVID-19 data by country: from January 1 to July 6, 2020 We do not add any restrictions to use the extra data. The participants can add or ignore any given data even the satellite images if they want. But, we will ask any participants to make open any extra data if utilized on Discussion / Notebooks or publishing their Kernels. The data update timeline Friday, June 12: NTL satellite images from January 1 to June 8, 2020 COVID-19 case numbers from January 1to June 8, 2020 The 75-day moving average of WTI crude oil prices from April 18, 1986 to June 8, 2020 Monday, June 22: All data will be updated with available NTL satellite images, COVID-19 and WTI prices data until June 15, 2020 Monday, June 29: All data will be updated with available NTL satellite images, COVID-19 and WTI prices data until June 22, 2020 Friday, July 3: All data will be updated with available NTL satellite images, COVID-19 and WTI prices data until June 29, 2020 Friday, July 10: All data will be updated with available NTL satellite images, COVID-19 and WTI prices data until July 6, 2020 The timeline might be slightly adjusted due to the Org Team's working progress and the available data. The evaluation criteria and the leaderboard The evaluation criteria is (hereafter also referred to as the Scores) is the Root Mean Squared Error between the actual 75-day average of WTI price and the predicted value on each date during the target period. The Leaderboard will be calculated with the available latest WTI data until the Day X. The Day X will be the June 8, 15, 22, 29 or July 6, 2020 respectively on the above data update timeline. We need some dates for the recalculations when changing Day X (e.g. from June 8 to 19) due to the manual works by Kaggle Operation Team. We also will provide all available data including the actual 75-day average until Day X above. You might upload those ""answers"" as your submissions and you can easily get the best score (0.0) on the Public Leaderboard. But, please do not submit those answers but your prediction results so that the participants can compare their model performance with each other. However, the leaderboard itself can show only referral score against the all available 75-day average of the WTI Prices between April 29 and Day X and not the target period. The participants can only use this Public Leaderboard to verify the submission format and to check the performance of their algorithms and those improvements. The actual winner will be decided by the two final submissions and the actual future WTI averages after the competition period between July 16 and August 21, 2020. And the winner will be announced on this Overview page, Discussion, and the Landingpage respectively. We also might use Private Leaderboard to show the rankings if we can recalculate the score after the target period.`'",,Global AI Challenge 2020,,,rmse,global-ai-challenge-2020 1087,"'`This is a competition for participants of the Open Machine Learning Course by ODS in Dubai. The goal is to predict whether income exceeds $50K/yr based on census data. Target variable is target which contains 1 if income exceeds $50K/yr and 0 otherwise. For this task you can extract any features which you want, but you are limited with ML model, you can use only Decision Trees and kNN You should write your solutions only in kaggle kernels. You can keep it private until the end of competition, but you if you want you can public any notebook before the end. Cheating = ban from ODS. If you cannot show (publish) your notebook after the end of the course, which replicates your score, you will be excluded from the leaderboard. If you want to get additional credits on the course, you may want to write an article using Kaggle Kernels about LogLoss function. Page of the course`'",,"MLClass Dubai by ODS, Lecture 3 HW",,,logloss,"mlclass-dubai-by-ods,-lecture-3-hw" 1088,"'`At Shopee, we strive to ensure fairness to both buyers and sellers, and improve user experience by identifying and discouraging negative behaviour. Listing quality is a major area where poor behaviours often occur. Every transaction on Shopee starts from a product listing. In order to get more sales, sellers may commit certain behaviour to increase their listings exposure and gain an unfair advantage over other shops. An example of such behaviour is keyword spam, whereby sellers input irrelevant keywords in the listing title that do not accurately describe the products they are selling. For instance, the product title claims that the listing is for pants, shirt, shoes, while the item that is actually being sold is just a pair of pants. Sellers do this in the hope that when buyers search for shirt or shoes, their listings would also appear in the search result. This behaviour of spamming irrelevant keywords in the title may confuse search engine and could affect the accuracy of search results, and therefore result in a poor user experience. Therefore, it is important to identify, punish and deter such behaviour from existing on Shopee. However, at the same time, we also need to consider the case that sellers input multiple product keywords in the listing title but those keywords are relevant to the products. An example is that the underlying product is a pair of shoes, and seller describes it in the listing title as ""shoes, sneakers"". In this case, the seller is trying to increase their search exposure, but does not use a misleading product title, and therefore should not be penalized. While it is important to deter negative behaviour, it is also very important to avoid wrongly discouraging positive behaviour. Task: Using the keyword directory, identify the product groups that are present in the product title. Example: Group: 0, Keywords: jacket Group: 1, Keywords: windbreaker, raincoat Index: 0, Name: red jacket windbreaker Since titles in name contains keywords from both groups 0 and 1. --> groups found: [0,1] Input 1.Extra Material 2 - keyword list_with substring.csv: List of product keywords, separated into product groups. Each row is a product group. The same keyword may appear in multiple groups (eg. notebook) Some keywords can be substrings of other keywords (eg. keyboard and electric keyboard). In that case, the longer word should take priority over the substring 2.Extra Material 3 - mismatch list.csv: List of mismatch keywords, separated according to keywords. Each column is a mismatch word list. The column header is the product keyword, the remaining cells in the column are the corresponding mismatched words. For each product keyword A (eg. table) in the list, if the corresponding mismatched words B (eg. tennis) are also found in the product name, then keyword A should not count towards the keyword groups found in that product name. 3.Keyword_spam_question.csv: File containing product name that you need to extract the product keyword groups. Further Details You will be given a directory of product keywords, organized into keyword groups. The .csv file provided will have 2 columns: Group: arbitrary index of the product keyword grouping Keywords: product keyword. Keywords on the same row denote words that can refer to the same product, and therefore should be considered the same product type (eg. raincoat and windbreaker can refer to the same product) Keywords on different rows denote words that refer to different product types (eg. shirt and raincoat refer to different product types) One keyword may appear in multiple groups (eg. notebook could refer to a computing product or a stationary) Some keywords can be substrings of other keywords (eg. keyboard and electric keyboard). In that case, the longer word should take priority over the substring Some keywords change the meaning with the presence of other words in the product name, and therefore should no longer be considered a product keyword if the product name contains certain words (eg. table is no longer a product keyword if the product name also contains tennis). In this case, table should not be counted in a keyword group. Note: you do not need to look into the correctness of the grouping, and should use it as-is. Using the keyword directory, you need to identify the product groups that are present in the product title. If 2 product groups are both equally presentable in the result, choose the group with the smaller index. Eg 1: White netbook, ultrabook and gaming mousepad should contain product groups [77, 85], because keyword netbook is in group 77; keyword ultrabook is also in group 77; keyword gaming mousepad is in group 85. Eg 2: Beautiful red notebook shirt jeans should contain product groups [6, 29, 77], because keyword notebook is in groups 77 and 204; keyword shirt is in group 29; keyword jeans is in group 6. Since using group 77 or group 204 will both result in 3 product groups, we will choose group 77 due to the smaller index. Eg 3: Printer toner wallpaper ink should contain product groups [81, 182], because keyword Printer toner is in group 81; keyword wallpaper is in group 182. Even though keyword Printer is in another group (79), it is a substring of Printer toner. Therefore 'Printer toner' takes priority over 'Printer'. Eg 4: Table tennis racket shoes should contain product groups [45, 108], because keyword racket is in group 108 and keyword shoes is in group 45. Even though keyword table is also present in the keyword dictionary, it is not counted because tennis is also present in the product title. Submit Format A csv file (utf-8 encoding) containing 2 columns: index : index of the item in the Keyword_spam_question.csv file groups_found : list of groups that are found in the corresponding item title, sorted ascending. Group names should be according to Extra Material 2 - keyword list_with substring.csv file. If 2 product groups are both equally presentable in the result, choose the group with the smaller index. Example: index groups_found 0 [77] 1 [216, 217] 2 [216, 218, 221] Your submission should have 800000 rows, each with 2 columns. Tips: 1) You are advised to run your tests on a sample of the dataset first. 2) If you are unable to solve the entire problem within the time limit, create the output csv with the required number of columns and rows based on a subset of the problem first. Teams which do not make a successful submission for both rounds of the competition will not be considered for the overall ranking.`'",,[Open] I'm the Best Coder! Challenge 2019,inClass,Round 2,categorizationaccuracy,[open]-im-the-best-coder!-challenge-2019 1089,"'`As caractersticas so calculadas a partir de imagens digitalizadas de massas mamria e descrevem caractersticas dos ncleos celulares presentes na imagem. Informaes sobre Atributos: 1) ID 2) Diagnstico (M = maligno, B = benigno) M = 1 B = 0 3) a 32) Dez caractersticas reais so calculadas para cada ncleo celular: a) raio (mdia das distncias do centro para os pontos no permetro) b) textura (desvio padro dos valores da escala de cinza) c) permetro d) rea e) suavidade (variao local nos comprimentos do raio) f) compactidade (permetro ^ 2 / rea - 1.0) g) concavidade (gravidade das pores cncavas do contorno) h) pontos cncavos (nmero de pores cncavas do contorno) i) simetria j) dimenso fractal (""aproximao costeira"" - 1) A mdia, erro padro e ""pior"" ou maior (mdia dos trs maiores valores) desses recursos foram calculados para cada imagem, resultando em 30 recursos. Por exemplo, o campo 3 o raio mdio, o campo 13 o raio SE, o campo 23 o pior raio. Todos os valores de recursos so recodificados com quatro dgitos significativos. Valores de atributo ausentes: nenhum Distribuio de classes: 357 benignas, 212 malignas`'",,Predio de Cncer de Mama,,,auc,predio-de-cncer-de-mama 1090,"'`Predicting wikipedia pagehits This competition is our own little version of the web traffic time series forecasting challenge here: https://www.kaggle.com/c/web-traffic-time-series-forecasting Your task is to make 30 day forecasts for 1000 wikipedia pages. The training dataset consists of about a year's worth of traffic data for 100,000 wikipedia pages. The evaluation metric is root mean squared error (not SMAPE as in the original competition) You will notice that this is basically the same task as described in the book, reproducing the book's solution is a perfectly valid thing to do. You are encouraged to experiment and change a few things though.`'",,Oxford Time Series,,,rmse,oxford-time-series 1091,"'` "" 2"" OzonMasters 2020 . - """" ( , ) . : due (UTC). fclass, sclass, t_class . (, "" ""). , . lon, lat , . 2.`'",,ML2_OZONMASERS_2020_C1,inClass,Предсказание отмены заказа в такси,auc,ml2_ozonmasers_2020_c1 1092,"'`About Us One Fourth Labs is an IIT Madras incubated startup with a goal to make India ready for the AI age. We want to skill Indian workforce in the areas of Artificial Intelligence (AI) at almost one-tenth the industry upskilling price. Our flagship online school PadhAI provides India-specific courses on AI, and is open to all students, faculty, and professionals with a basic background in mathematics and python Task The goal is to identify the people's opinion on mobile phones using MP Neuron. The data points are scraped from 91mobiles.com Evaluation Metric Submissions are evaluated on **Accuracy Score** between the predicted and the actual labels on test dataset Acknowledgements Mobile Data Source: https://www.91mobiles.com/ Check this video tutorial to know how to make a submission for this contest.`'",,PadhAI: MP Neuron - Like Unlike Classification,,,categorizationaccuracy,padhai:-mp-neuron-like-unlike-classification 1093,"'`About Us One Fourth Labs is an IIT Madras incubated startup with a goal to make India ready for the AI age. We want to skill Indian workforce in the areas of Artificial Intelligence (AI) at almost one-tenth the industry upskilling price. Our flagship online school PadhAI provides India-specific courses on AI, and is open to all students, faculty, and professionals with a basic background in mathematics and python Task The goal is to identify the presence of a character in images using MP Neuron / Perceptron / Perceptron with sigmoid. The character images are compiled in Tamil, Hindi and English. We have altered the task in 4 levels with increase in data complexity. Evaluation Metric Submissions are evaluated on Accuracy Score between the predicted and the actual labels on test dataset Acknowledgements Tamil Character Data: http://www.jfn.ac.lk/index.php/data-sets-printed-tamil-characters-printed-documents/ Hindi Character Data: https://www.kaggle.com/ashokpant/devanagari-character-dataset`'",,PadhAI: Text - Non Text Classification Level 1,,,categorizationaccuracy,padhai:-text-non-text-classification-level-1 1094,"'`About Us One Fourth Labs is an IIT Madras incubated startup with a goal to make India ready for the AI age. We want to skill Indian workforce in the areas of Artificial Intelligence (AI) at almost one-tenth the industry upskilling price. Our flagship online school PadhAI provides India-specific courses on AI, and is open to all students, faculty, and professionals with a basic background in mathematics and python Task The goal is to identify the presence of a character in images using MP Neuron / Perceptron / Perceptron with sigmoid. The character images are compiled in Tamil, Hindi and English. We have altered the task in 4 levels with increase in data complexity. Evaluation Metric Submissions are evaluated on Accuracy Score between the predicted and the actual labels on test dataset Acknowledgements Background Data: http://www.image-net.org/ Tamil Character Data: http://www.jfn.ac.lk/index.php/data-sets-printed-tamil-characters-printed-documents/ Hindi Character Data: https://www.kaggle.com/ashokpant/devanagari-character-dataset`'",,PadhAI: Text - Non Text Classification Level 3,,,categorizationaccuracy,padhai:-text-non-text-classification-level-3 1095,"'`About Us One Fourth Labs is an IIT Madras incubated startup with a goal to make India ready for the AI age. We want to skill Indian workforce in the areas of Artificial Intelligence (AI) at almost one-tenth the industry upskilling price. Our flagship online school PadhAI provides India-specific courses on AI, and is open to all students, faculty, and professionals with a basic background in mathematics and python Task The goal is to identify the people's opinion on mobile phones using Perceptron. The data points are scraped from 91mobiles.com Evaluation Metric Submissions are evaluated on **Accuracy Score** between the predicted and the actual labels on test dataset Acknowledgements Mobile Data Source: https://www.91mobiles.com/ Check this video tutorial to know how to make a submission for this contest.`'",,PadhAI: Perceptron - Like Unlike Classification,,,categorizationaccuracy,padhai:-perceptron-like-unlike-classification 1096,'`You know it`',,Parkinson's detection using ML,inClass,Use all the ML magic to detect Parkinson's disease,categorizationaccuracy,parkinsons-detection-using-ml 1097,"'`In this competition you will work with a challenging time-series dataset consisting of monthly sales data for 3 years. This sales are only Snacks data in PepsiCo for 500 customers (grocery, kiosk etc) with their demographic data, placed in Istanbul Turkey. We are asking you to predict total sancks sales for every store in the next months.`'",,PEPSI Projesi,inClass,Data & Analyitcs Challenge 2.Online Proje 22-28 Ocak,rmse,pepsi-projesi 1098,"'`To create a 3D map of the Universe we need to measure 3 coordinates of galaxies. The celestial coordinates on the sky are easy to get, but measuring their distance is much harder. Since the Universe is expanding, the photons from a far away galaxy are also expanded during their long journey. The expansion of photons is the redshift: well known spectral features are shifted to redder colours. According to Hubble's law, the distance is approximately proportional to the redshift. Since galaxies are very faint, it takes a lot of observing time to spread their light in spectrographs and get high resolution spectrum. So, we have spectroscopic redshift only for a limited set of them, for others, redshift may be estimated form broadband photometry, the brightness of galaxies took by few colour filters. Estimating the redshift from this limited set is called photometric redshift estimation. A nice animation which says more than a thousand words (open is a new tab and try to refresh if it does not move) Acknowledgements We thank Professor Istvan Csabai, for providing the description for the challenge. References: Recent photoz for SDSS: Beck, Rbert, et al. ""Photometric redshifts for the SDSS Data Release 12."" Monthly Notices of the Royal Astronomical Society 460.2 (2016): 1371-1381. A recent review: Salvato, Mara, Olivier Ilbert, and Ben Hoyle. ""The many flavours of photometric redshifts."" Nature Astronomy (2018). A classic reference: Benitez, Narciso. ""Bayesian photometric redshift estimation."" The Astrophysical Journal 536.2 (2000): 571. More useful references for photo-z in recent surveys (CFHTLenS, KiDS, DES): Hildebrandt, H., et al. ""CFHTLenS: improving the quality of photometric redshifts with precision photometry."" Monthly Notices of the Royal Astronomical Society 421.3 (2012): 2355-2367. Hildebrandt, H., et al. ""KiDS-450: Cosmological parameter constraints from tomographic weak gravitational lensing."" Monthly Notices of the Royal Astronomical Society 465.2 (2016): 1454-1498. Hoyle, Ben, et al. ""Dark Energy Survey Year 1 Results: redshift distributions of the weak-lensing source galaxies."" Monthly Notices of the Royal Astronomical Society 478.1 (2018): 592-610.`'",,Photometric redshift estimation,,,rmse,photometric-redshift-estimation 1099,"'`Hi! It is boring to wash the dishes. Luckily, half of them are already clean. Train a classifier to determine clean ones to save time for the new machine learning course ;) It is a few shot learning competition. We have a dataset of 20 clean and 20 dirty plates in train and hundreds of plates in test. Good luck!`'",,Cleaned vs Dirty V2,,,categorizationaccuracy,cleaned-vs-dirty-v2 1100,"'`[]""Pneumonia Detection by Chest X-Ray Images ( X )"" [] () (Binary Classification Model) [] X () (Binary Classifier) F1-score Test dataset recall precision [NOTE]recall precision Wikipedia Precision and recall Recall = (True Positive) / (True Positive + False Negative) Precision = (True Positive) / (True Positive + True Negative) [] Evaluation [] X X Kaggle Kaggle's Datasets : ""Chest X-Ray Images (Pneumonia) - 5,863 images, 2 categories"". 2018/02/22 Cell ""Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning"" by Daniel S. Kermany, et al., on The Cell Journal ( VOLUME 172, ISSUE 5, P1122-1131.E9, FEBRUARY 22, 2018. ) "" Application of the AI System for Pneumonia Detection Using Chest X-Ray Images ... ... We collected and labeled a total of 5,232 chest X-ray images from children, including 3,883 characterized as depicting pneumonia (2,538 bacterial and 1,345 viral) and 1,349 normal, from a total of 5,856 patients to train the AI system. The model was then tested with 234 normal images and 390 pneumonia images (242 bacterial and 148 viral) from 624 patients. ... ... ... 5,856 X 5,232 X 3,883 X (2,538 1,345 ) 1,349 X AI 624 X 234 390 (242 148 )... Figure S6 - Illustrative Examples of Chest X-Rays in Patients with Pneumonia . ( X ) The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia (middle) typically exhibits a focal lobar consolidation, in this case in the right upper lobe (white arrows), whereas viral pneumonia (right) manifests with a more diffuse interstitial pattern in both lungs."" X () (bacterial pneumonia) X () (focal lobar consolidation) () (viral pneumonia) X () "" (interstitial)"" [ REFERENCE ]: Kaggle's Datasets : ""Chest X-Ray Images (Pneumonia) - 5,863 images, 2 categories"", https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia Daniel S. Kermany, et al., ""Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning"", Cell Journal: VOLUME 172, ISSUE 5, P1122-1131.E9, FEBRUARY 22, 2018. https://www.cell.com/cell/fulltext/S0092-8674(18)30154-5 Wikipedia, ""Precision and recall"" https://en.wikipedia.org/wiki/Precision_and_recall , "" (Pneumonia)"", http://www2.cmu.edu.tw/~cmcmd/ctanatomy/clinical/styled-3/acutepancreatitis.html - , "" lobar pneumonia"" http://terms.naer.edu.tw/detail/2774974/`'",,Pneumonia Detection,,,meanfscore,pneumonia-detection 1101,"'`: . ! 2 : WORD-2-VEC, FastText, GLOVE. 9- LSTM TRANSFORMER. ======================== : ['news', 'musical', 'drama', 'romance', 'war', 'biography', 'sci-fi', 'thriller', 'fantasy', 'documentary', 'reality-tv', 'adventure', 'mystery', 'action', 'sport', 'horror', 'comedy', 'short', 'western', 'talk-show', 'adult', 'game-show', 'music', 'history', 'crime', 'family', 'animation']`'",text data,Movie Genre Classification,inClass,Predict movie genre,categorizationaccuracy,movie-genre-classification 1102,"'`Sobre o conjunto de dados O conjunto de dados Iris foi usado no clssico artigo de 1936 de R. A. Fisher, The Use of Multiple Measurements in Taxonomic Problems, e tambm pode ser encontrado no UCI Machine Learning Repository. Ele inclui trs espcies de ris com 50 amostras cada, bem como algumas propriedades sobre cada flor. Uma espcie de flor linearmente separvel das outras duas, mas as outras duas no so linearmente separveis umas das outras. Colunas disponveis As colunas neste conjunto de dados so: Id (identificao nica) SepalLengthCm (comprimento da spala - em centmetros) SepalWidthCm (largura da spala - em centmetros) PetalLengthCm (comprimento da ptala - em centmetros) PetalWidthCm (largura da ptala - em centmetros) Species (nome da espcie)`'",tabular data,SERPRO - Iris,inClass,Classifique plantas íris em três espécies distintas neste clássico conjunto de dados,categorizationaccuracy,serpro-iris 1103,"'`About Us One Fourth Labs is an IIT Madras incubated startup with a goal to make India ready for the AI age. We want to skill Indian workforce in the areas of Artificial Intelligence (AI) at almost one-tenth the industry upskilling price. Our flagship online school PadhAI provides India-specific courses on AI, and is open to all students, faculty, and professionals with a basic background in mathematics and python Task The goal is to identify the presence of a character in images using MP Neuron / Perceptron / Perceptron with sigmoid. The character images are compiled in Tamil, Hindi and English. We have altered the task in 4 levels with increase in data complexity. Evaluation Metric Submissions are evaluated on Accuracy Score between the predicted and the actual labels on test dataset Steps to use the predefined models You can use the predefined models namely MPNeuron, Perceptron and PercepetronWithSigmoid in padhai.py by following the steps below: 1) Add the input folder to sys path. import sys sys.path.insert(0, ""../input"") 2) Import the model from padhai.py. from padhai import MPNeuron, Perceptron, PerceptronWithSigmoid 3) Instantiate the model. model = Perceptron() 4) Fit the model using training data. model.fit(X_train, Y_train) 5) Predict the labels for test data. y_pred = model.predict(X_test) Acknowledgements Tamil Character Data: http://www.jfn.ac.lk/index.php/data-sets-printed-tamil-characters-printed-documents/ Hindi Character Data: https://www.kaggle.com/ashokpant/devanagari-character-dataset`'",image data,[T] PadhAI: Text - Non Text Classification Level 1,inClass,Can you predict whether an image has TEXT or NOT?,categorizationaccuracy,[t]-padhai:-text-non-text-classification-level-1 1104,"'`Challenge Description The aim of this challenge is to predict the prices of properties in Washington DC by exploring the various characteristics of properties and their effect on the sales price. The dataset provided consists of 47 explanatory variables describing various aspects of residential homes. This dataset contains information on real property sales for sales between May 1947 to July 2018 for properties located in Washington DC. Practice Skills Creative feature engineering Advanced regression techniques like random forest and gradient boosting in addition to ensemble learning and stacking methods. Acknowledgements All data is available at Open Data DC. The residential and address point data is managed by the Office of the Chief Technology Officer. Distribution Liability: data terms and conditions`'",tabular data,Property price prediction challenge,inClass,Challenge yourselves to the property price prediction challenge hosted by GA technologies,rmsle,property-price-prediction-challenge 1105,"'`O que abalone? Abalone um gnero de moluscos gastrpodes marinhos da famlia Haliotidae e o nico gnero catalogado desta famlia. Contm diversas espcies em guas costeiras de quase todo o mundo. O abalone muito valorizado na gastronomia de pases asiticos. Alm disso, sua concha comumente usada na criao de joias, especialmente devido ao seu brilho iridescente. Suas dimenses variam de dois a trinta centmetros. Devido sua demanda e alto valor econmico, muitas vezes o abalone colhido em fazendas e, como tal, existe a necessidade de prever a sua idade. Como calculada a idade de um abalone? A abordagem tradicional para determinar sua idade cortando a concha atravs do cone, manchando-a e contando o nmero de anis atravs de um microscpio - uma tarefa tediosa e demorada. Algumas medidas fsicas, mais fceis de obter, podem ser usadas para prever a idade de um abalone. Outras informaes, como padres climticos e localizao (portanto, disponibilidade de alimentos) podem ser necessrias para solucionar o problema. Sobre o conjunto de dados O conjunto de dados em questo pode ser usado para obter um modelo matemtico para prever a idade de um abalone a partir de suas medies fsicas. Colunas disponveis As colunas neste conjunto de dados so: id (Inteiro): identificao nica de cada indivduo sex (String): gnero do indivduo, pode ser M: macho, F: fmea e I: infantil length (Real): Comprimento - maior medida da concha (em mm) diameter (Real): Medida perpendicular ao comprimento (em mm) height (Real): Altura com a carne na concha (em mm) whole_weight (Real): Peso do abalone inteiro (em g) shucked_weight (Real): Peso exclusivo da carne (em g) viscera_weight (Real): Peso das vsceras aps secagem (em g) shell_weight (Real): Peso da concha depois de seca (em g) rings (Inteiro): Quantidade de anis (equivalente idade em anos)`'",tabular data,SERPRO - Abalone,inClass,Previsão da idade de moluscos a partir de medições físicas,rmse,serpro-abalone 1106,"'`Enonc Ce second TP porte sur la conception d'embeddings de phrases et de prdiction de similarits entre couples de phrases apparies. Il s'agira par exemple de prdire que le couple de phrases : The girl in the blue coverall is painting The woman is holding the paintbrush next to the artist 's easel dcrit des phrases de contenu smantique similaire, alors que le couple de phrases : The man is singing heartily and playing the guitar A bicyclist is holding a bike over his head in a group of people est un couple de phrases sans lien smantique apparent. Enonc Ce second TP porte sur la conception d'embeddings de phrases et de prdiction de similarits entre couples de phrases apparies. Il s'agira par exemple de prdire que le couple de phrases : The girl in the blue coverall is painting The woman is holding the paintbrush next to the artist 's easel dcrit des phrases de contenu smantique similaire, alors que le couple de phrases : The man is singing heartily and playing the guitar A bicyclist is holding a bike over his head in a group of people est un couple de phrases sans lien smantique apparent. Le jeu de donnes utilis pour ce TP est driv de SICK-R, qui lui-mme appartient l'ensemble SentEval Chaque ligne de donnes comporte un identifiant numrique, une phrase A, une phrase B et une similarit prdire, value dans l'intervalle [0,1]. Formellement, soit \( ( {\bf x}_1, {\bf x}_2 ) \) un couple de phrases tester, on prdit y un rel dans l'intervalle [0,1] qui reprsente la similarit entre les phrases. Chaque embedding de phrase est obtenu par un bi-lstm : $$ \begin{align} {\bf h}_1 &= {\it bilstm}(x_1^{(1)} \ldots x_n^{(1)} ) \\ {\bf h}_2 &= {\it bilstm}(x_1^{(2)} \ldots x_m^{(2)} ) \end{align} $$ La prdiction de similarit est ralise par un modle logistique: $$ P(Y=1 | {\bf x}_1, {\bf x}_2 ) = \frac{exp\left({\bf w}^T \left[\begin{array}{ll} {\bf h}_1 \\ {\bf h}_2 \\ \end{array}\right] \right) }{1 + exp\left({\bf w}^T \left[\begin{array}{ll} {\bf h}_1 \\ {\bf h}_2 \\ \end{array} \right] \right) } $$ Le fichier de prdictions construire comportera sur chaque ligne l'identifiant numrique suivi de la prdiction faite par votre systme. La tche consiste : Forker le kernel Sick dpart Ecrire une mthode train() du modle qui utilisera un bi-LSTM Ecrire une mthode run_test() du modle qui fera la prdiction sur des exemples de test. Utiliser la librairie torch.text pour batcher les donnes d'entrainement. Cela fait, vous pourrez activer le GPU dans votre kernel et observer l'acclration du processus d'entrainement Soumettre votre fichier de prdiction la comptition Le tutoriel suivant peut servir d'exemple pour comprendre le fonctionnement de la librairie torch.text: https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext Barme indicatif Qualit du code (1) Avoir batch le code et obtenu une excution efficace avec le GPU (1) Avoir soumis un jeu de test de rsultat nettement suprieur la baseline (2) Avoir soumis un jeu de test de rsultat suprieur au corrig du prof (1)`'",text data,Sentence Relatedness,inClass,Sentence embeddings and paraphrase detection (AA3 - TP2),correlationcoefficient,sentence-relatedness 1107,'` $50k .`',tabular data,2019 SMHRD ( ),,,categorizationaccuracy,2019-smhrd--(--) 1108,"'`Tamao Dataset: 2075259 (Se ver reducido para la evaluacin en Kaggle) Caracterstica del DataSet: Multivariate, Time-Series Justificacin de recoleccin de datos: Predecir la cantidad global activa de consumo de energa promedio por minuto Consideraciones: El indicador a evaluar ser RMSE.`'",tabular data,Competencia-Series-Temporales,inClass,Esta competencia será sobre el problema de poder predecir el uso de Consumo de Energía en una casa a través de las horas. ,rmse,competencia-series-temporales 1109,"'`Background At Shopee, we always strive to ensure the correct listing and categorization of products. For example due to the recent pandemic situation, face masks become extremely popular for both buyers and sellers, everyday we need to categorize and update a huge number of masks items. A robust product detection system will significantly improve the listing and categorization efficiency. But in the industrial field the data is always much more complicated and there exists mis-labelled images, complex background images and low resolution images, etc. The noisy and imbalanced data and multiple categories make this problem still challenging in the modern computer vision field. Note: This page is for participants from student group! Task In this competition, a multiple image classification model needs to be built. There are ~100k images within 42 different categories, including essential medical tools like masks, protective suits and thermometers, home & living products like air-conditioner and fashion products like T-shirts, rings, etc. For the data security purpose the category names will be desensitized. The evaluation metrics is top-1 accuracy.`'",image data,[Student] Shopee Code League - Product Detection,,,categorizationaccuracy,[student]-shopee-code-league-product-detection 1110,"'`Statistical forecasting is the art and science of forecasting from data, with or without knowing in advance what equation you should use. The idea is simple: look for statistical patterns in currently available data that you believe will continue into the future. In other words, figure out the way in which the future will look very much like the present, only longer. Time Series forecasting techniques are very important for a data scientist to master as time series data occur in every domain from medical, stock market, climate change prediction, and so on. In that note, here we have presented in front of you an opportunity to master time series forecasting techniques and along with that stand an opportunity to get an internship with TerraBlue XT.`'",tabular data,Predice el futuro,inClass,Forecast the values of feature one. ,rmse,predice-el-futuro 1111,"'`The World Department of Commerce launched Census Bureau to gather data on the countrys earnings, employment, and demographics. Its mission is to become the main source of public data about the nations people and economy. The Census data informs for instance on individuals age, level of education, employment type and income. This information could be hugely significant to businesses. For instance, think of the businesses which target individuals with high income. These businesses could use machine learning models trained on US Census data to predict someones income. If this persons income is higher than a given threshold (for instance $50,000 per year), then the company decides to reach out to him. Better detection and segmentation of potential customers reduces marketing costs, increases conversion rate and thus improves return on investment Overview of How Kaggles Competitions Work 1. Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. 2. Get to Work Download the data, build models on it locally or on Kaggle Kernels (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. 3. Make a Submission Upload your prediction as a submission on Kaggle and receive an the score based on the metric(here it's F1 Score). 4. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. 5. Final Evaluation The top 5 teams on the private leaderboard will be selected based on their metric score. Then out of 10 teams, the team with best Kernel will win. Visit Evaluations tab for more info. How to Submit your Prediction to Kaggle Once youre ready to make a submission and get on the leaderboard: 1. Click on the Submit Predictions button 2. Upload a CSV file in the submission file format. Youre able to submit 15 submissions a day. Submission File Format: You should submit a csv file with exactly 8561 entries plus a header row. Your submission will show an error if you have extra columns (beyond ID and Income) or rows. The file should have exactly 2 columns: ID (sorted in increasing order) Income (contains your predictions) Got it! Im ready to get started. Where do I get help if I need it? Technical Help: Kaggle Contact Us Page or else react out to Sudip We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn!`'",tabular data,Predict the Income - WITH BOARD,,Predict the income of individual belonging from various demographics,meanfscore,predict-the-income--with-board 1112,"'`Background PT. GAIB FINTECH NUSANTARA is a Fintech company with an app that helps users conduct digital financial transactions. To impress their investors they want to develop a machine learning based model to predict when their users are going to churn so they can do targeted marketing. Customer churn is a condition where a customer stops using a company's product or service after a certain period of time. For this case, the period of churn is set to one month of inactivity (0 transactions in one month). Task As the bright new Data Scientist in the company, you are tasked to help them build this model from scratch. You are given a dataset from the data analyst team containing user activity data. The data is activity done in one month starting from the date collected attribute. Each data is given the label isChurned which indicates wether or not that user churns in the next month . Illustration user_id date_collected total_transaction isChurned 1 2019-01-01 100000 1 2 2019-02-01 75000000 0 Informasi yang didapat : User 1 melakukan total transaksi senilai Rp.100.000 dari tanggal 1 Januari 2019 sampai tanggal 31 Januari 2019. User 1 melakukan churn pada bulan selanjutnya (1 Februari - 28 Februari) User 2 melakukan total transaksi senilai Rp.75.000.000 dari tanggal 1 Februari 2019 sampai tanggal 28 Februari 2019. User 2 TIDAK melakukan churn pada bulan selanjutnya Important : DO NOT SHARE COMPETITION LINK OUTSIDE OF GAIB LAB SELECTION PARTICIPANTS`'",tabular data,Seleksi Calon Asisten GAIB,inClass,Seleksi Gaib Tugas III,meanfscore,seleksi-calon-asisten-gaib 1113,"'` - . , , . , . , . - . : df['error'] = np.linalg.norm(df[['x', 'y', 'z']].values - df[['x_sim', 'y_sim', 'z_sim']].values, axis=1) SMAPE.`'",tabular data, ,,,smape,--- 1114,"'`As the COVID-19 keeps unleashing its havoc, the world continues to get pushed into the crisis of the great economic recession, more and more companies start to cut down their underperforming employees. Companies firing hundreds and thousands of Employees is a typical headline today. Cutting down employees or reducing an employee salary is a tough decision to take. It needs to be taken with utmost care as imprecision in the identification of employees whose performance is attriting may lead to sabotaging of both employees' career and the company's reputation in the market. Aim of The Competition To predict Employee Attrition by the given data about his/her past history. Acknowledgements We thank IBM for providing us with the dataset.`'",tabular data,Summer Analytics 2020 Capstone Project,inClass,,auc,summer-analytics-2020-capstone-project 1115,"'`Our second and optional task is to find the closest time series in terms of Euclidean distance in the dataset for all the provided 100 queries of both synthetic dataset and seismic dataset, i.e., similarity search. Our purpose here is to utilize a summarization method to perform similarity search faster. The observation is that since the summarizations are shorter than the original time series, calculating Euclidean distances using the summarizations is much faster than calculating Euclidean distances between the original time series. Some time series in the dataset might be skipped without checking its Euclidean distance with the query. This is the case when we say this time series is pruned. The question here is whether utilizing our summarization is safe: Are the series we prune correctly pruned? Notebook Submissions have to be in Python Go to 'Notebooks', then 'Your Work', then 'Create New Notebook' (or duplicate the template notebook) The submitted notebook should include the summarization/reconstruction functions you submitted for the summarization task of the project. Most importantly, the submitted notebook must include the function similarity that returns the pruning ratio averaged over all the 100 queries from both datasets. A solution template can be found here https://www.kaggle.com/abdumaa/similarity-search-solution-template Submission The submission file is a CV file containing a header 'id, expected' and one more line containing the value 1 and the pruning ratio return by the similarity function Here the official documentation to submit a solution from a notebook https://www.kaggle.com/dansbecker/submitting-from-a-kernel I would rather recommend this youtube video of how to participate and submit a solution in a kaggle competition from a notebook (kernel)https://www.youtube.com/watch?v=GJBOMWpLpTQ To follow up on the youtube video, if you cannot find the button ""Commit"", so you need to click on ""Save Version"", then choose ""Save & Run All (commit)"" Do not forget to share your submission notebook after you have submitted your solution: click on the button ""share"", type ""abdu maa"" into the ""collaborator"" search box click ""save""`'",tabular data,Similarity Search Project,inClass,,rmse,similarity-search-project 1116,"'`Nothing ruins the thrill of buying a brand new car more quickly than seeing your new insurance bill. The stings even more painful when you know youre a good driver. It doesnt seem fair that you have to pay so much if youve been cautious on the road for years. Porto Seguro, one of Brazils largest auto and homeowner insurance companies, completely agrees. Inaccuracies in car insurance companys claim predictions raise the cost of insurance for good drivers and reduce the price for bad ones. In this competition, youre challenged to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year. While Porto Seguro has used machine learning for the past 20 years, theyre looking to Kaggles machine learning community to explore new, more powerful methods. A more accurate prediction will allow them to further tailor their prices, and hopefully make auto insurance coverage more accessible to more drivers.`'",,Porto Seguros Safe Driver Prediction,,,normalizedgini,porto-seguros-safe-driver-prediction 1117,"'`In this challenge, your task is to calibrate NWP forecasts at 500 surface stations in Germany.`'",,Postprocessing,inClass,Postprocess NWP temperature forecasts. ,mse,postprocessing 1118,"'`The dataset used is Wine Quality Data set from UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Wine+Quality Use machine learning to determine which physicochemical properties make a wine 'good'! You need to predict the quality of wine using regressions techniques such as Linear and Multiple Linear Regression. Let's see who can do it best! Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10) Evaluation metric: R squared`'",,Predict Red Wine Quality,,,r2score,predict-red-wine-quality 1119,"'`Background Predictive maintenance techniques are designed to help anticipate equipment failures to allow for advance scheduling of corrective maintenance, thereby preventing unexpected equipment downtime, improving service quality for customers, and also reducing the additional cost caused by over-maintenance in preventative maintenance policies. Many types of equipmente.g., automated teller machines (ATMs), information technology equipment, medical devices, etc.track run-time status by generating system messages, error events, and log files, which can be used to predict impending failures. Goal The dataset is in kind of time series, consisting of log message and failure record of 984 days. The problem is to predict which day is a failure day in advance (e.g. 1-day in advance prediction) based on the features (constructed from log file of error messages before the predicted day). Data Feature: the log data of the target machine. There are MANY different error types when machine running. Each types of error has a number id (for example: 136088194). The feature file is a collection of basic statistics of each error. count: how many times the error occurs in that day. min: tick of the first time the error occurs in that day (seconds). for example, min = 3600 means this error first occurs at 01:00 (3600 seconds in day). max: tick of the last time the error occurs in that day. mean: mean of tick the error occurs. std: standard deviation of tick the error occurs. Label: failure record of the target machine. 0: the machine is OK in that day. 1: machine break down for some reasons in that day. Hint: you are also encouraged to use label and date as feature to predict. But make sure do not use any information behind the day before the predicted day. Evaluation You may notice that machine failure is rare event (around 1/10 of whole period). In this case, we use AUC as criterion. Make sure your submission is probability of failure from your model, instead of prediction ( 0 or 1 ).`'",,Predictive Maintenance 1,inClass,Predict machine failure based on log data,auc,predictive-maintenance-1 1120,"'`Introduo Bem-vindos primeira etapa do processo seletivo para a vaga de estagirio em cincia de dados na Epistemic, candidatos! Recomendamos que voc leia atentamente todas as instrues. Esta prova visa avaliar sua capacidade de organizar um notebook, assim como a sua capacidade de entender um dataset e seu conhecimento sobre as 3 grandes reas de cincias de dados: Regresso, Classificao e Clusterizao. Assim sendo, o mais importante nesta avaliao no necessariamente quo bom seus resultados sero em cada uma das atividades, afinal a performance de vrios modelos depende de uma otimizao de hiper-parmetros que, por vezes, requer muito tempo. Preferimos poder avaliar o quo bem voc explica cada um de seus passos e a qualidade de sua explorao de dados. Assim sendo, organize bem seus notebooks e capriche nos grficos usados nas exploraes - lembre-se de nomear eixos, colocar ttulos, unidades e legendas, assim como colocar comentrios sobre o que foi descoberto/ observado em cada etapa, por mais trivial que parea. Sabemos que existem infinitas maneiras de modelar o mesmo problema - e infinitas ferramentas que podem ser usadas para faz-lo. Assim, esperamos que em cada uma das tarefas vocs testem alguns algoritmos diferentes, com uma breve explicao em Markdown do porque testar estes modelos. Ao final de cada questo, esperamos tambm uma pequena sesso em Markdown comentando os resultados obtidos na etapa de modelagem, comparando como cada modelo se portou em termos de generalizao (quando apropriado), tempo de treinamento, complexidade e performance. A soluo dos trs problemas deve se encontrar no MESMO kernel, separados por seces de Markdown. Em cada uma das questes, o objetivo atacar diferentes verses do mesmo dataset sobre o clima em 5 cidades chinesas - identificadas somente pelo seu nmero, de 0 a 4. As tarefas esto brevemente discutidas abaixo. Caso tenha alguma dvida sobre como submeter um kernel, me contacte imediatamente no e-mail joaomarcos.marques@epistemic.com.br Tarefa 1 - REGRESSO Nesta tarefa, voc tem como objetivo prever o VOLUME MENSAL total de chuva recebida por cada um dos lugares para cada ms e ano. Para esta tarefa, voc trabalhar com os arquivos [regression_targets_train, regression_targets_test, regression_features_train,regression_features_test]. Os primeiros dois arquivos contm a sua ground truth para o seu sets de treino (2010 - 2013) e de teste (2014 e 2015), Isso , eles relacionam cada ano, ms e cidade ao volume total de chuva observado no perodo. Os outros 2 arquivos contm informaes de hora em hora sobre as condies meteorolgicas observadas em cada um dos locais, para os sets de teste e de treino (melhor explicadas na aba de ""dados""). Teste vrios modelos e abordagens, explicando com comentrios no seu cdigo o por qu de cada etapa e processo, assim como possveis variveis auxiliares que foram usadas no seu modelo. Avalie sua performance por meio do Root Mean Squared Error(RMSE - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html ) Tarefa 2 - CLASSIFICAO Nesta tarefa voc ainda vai estar trabalhando com o dataset meteorolgico das 5 cidades, porm com um objetivo diferente. Em vez de tentar prever o VOLUME MENSAL de chuva para cada cidade, ano e ms, voc tentar prever se HOUVE OU NO CHUVA em um determinado dia, mes, ano e cidade. Nesta tarefa voc usar os arquivos [classification_targets_train, classification_targets_test, classification_features_train,classification_features_test], que seguem um processo similar ao da tarefa anterior em termos de nomeclatura e organizao. Como esta uma tarefa binria de classificao, avalie a performance de seus modelos com base em seus scores roc_auc (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html) Tarefa 3 - CLUSTERIZAO Ainda trabalhando com os dados meteorolgicos destas 5 cidades, seu objetivo agora criar um modelo que consiga clusteriz-las de acordo com seus comportamentos sazonais. Nesta tarefa, voc trabalhar com os dados no arquivo ""Clustering Dataset"", que contm uma srie de caractersticas sazonais de cada uma das cidades para cada uma das 4 estaes. Voc sabe que existem 5 cidades e 4 estaes - ache maneiras de agrup-las de forma no supervisionada e avalie a qualidade de seu agrupamento por meios visuais e automatizados - como, por exemplo, a silhueta (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html) de seus clusters. ATENO: VOC TEM 2 DIAS DESDE O RECEBIMENTO DO LINK DESTA COMPETIO PARA SUBMETER SEU KERNEL COMO PUBLICO COM SEUS RESULTADOS E ENVIAR UM E-MAIL AVISANDO SOBRE A SUA SUBMISSO. SUBMISSES ATRASADAS NO SERO CONSIDERADAS. BOA SORTE! Acknowledgements Nesta competio usamos dados extrados de : Liang, X., S. Li, S. Zhang, H. Huang, and S. X. Chen (2016), PM2.5 data reliability, consistency, and air quality assessment in five Chinese cities, J. Geophys. Res. Atmos., 121, 1022010236, [Web Link]. atravs do UCI machine learning repository.`'",,Processo Seletivo Epistemic tm,inClass,Prova para a vaga de cientista de dados- parte do processo seletivo,rmse,processo-seletivo-epistemic-tm 1121,"'`Challenge Description The aim of this challenge is to predict the prices of properties in Washington DC by exploring the various characteristics of properties and their effect on the sales price. The dataset provided consists of 47 explanatory variables describing various aspects of residential homes. This dataset contains information on real property sales for sales between May 1947 to December 2017 for properties located in Washington DC. Practice Skills Creative feature engineering Advanced regression techniques like random forest and gradient boosting in addition to ensemble learning and stacking methods. Acknowledgements All data is available at Open Data DC. The residential and address point data is managed by the Office of the Chief Technology Officer. Distribution Liability: data terms and conditions`'",,Property price prediction challenge 2nd,inClass,Challenge yourselves to the property price prediction challenge hosted by GA technologies,rmsle,property-price-prediction-challenge-2nd 1122,"'`ProtonX - InClass Prediction Competition Bi ton d on % i hc. Cc bn c th np ti a 10 ln / ngy.`'",,ProtonX - D on kh nng vo i hc,,,mse,protonx-d-on-kh-nng-vo-i-hc 1123,"'`Vamos a probar cunto sabes de elecciones Necesitamos saber el censo medio por municipio en cada autonomia Para hacerlo: Primero, debers conocer el censo medio de cada autonoma Segundo, debers conocer el nmero de municipios por autonoma Tercero, si conoces el censo medio de la autonoma y conoces el nmero de municipios sabrs el censo por municipio de esa autonoma Te atreves?`'",,prueba_master,inClass,test,mae,prueba_master 1124,"'`Top Performing Brands in Shopee There are over a 100 unique brands in the Shopee platform. Each brand belongs to a principal, and principals are valuable stakeholders to Shopee. In order to drive campaigns and sales revenue, it is crucial for Shopee to identify how each item from different brands perform under certain performance metrics, such as the Gross Merchandise Volume, Gross Orders & Gross Items Sold etc. As such, it is important for the Data Analysts in Shopee to be able to extract and analyse the data for each item in accordance to the performance metrics for the different brands, so that Shopee could focus on setting up and promoting campaigns for the best performing items. Gross Sales Revenue Definition Gross revenue is the total amount of sales recognized for a reporting period, prior to any deductions. It is calculated by the formula: amount*item_ price_usd. Task Write a function(s) that takes in a brand as an input and answer the following question.** The Top 3 itemids (in a list) from the Official Shop of that particular brand that generated the highest Gross Sales Revenue from 10th May to 31st May 2019. Note: Not all brands will have 3 itemids, in cases where there are none output should return N.A Submit Format For each brand (with or without 'Official Shop', total # brands: 270), name the Brand and its associated top 3 items ex: (3M, 1464762206, 2218766590, 1464762339). The Brands must be sorted alphabetically in ascending order. The output file format is csv and should have 2 columns: index column and a answer column. Example The csv file you upload is expected to be in the following format. Index Answers 1 3M, 1464762206, 2218766590, 1464762339 2 3M Littmann, 1464762206, 2218766590 3 AHC, N.A In the example answer, brand name is in the right order but the top 3 items were randomly chosen and had nothing to do with the correct answers. Your submission should have 270 rows (not including the header row), each with 2 columns. Tips: 1) You are advised to run your tests on a sample of the dataset first. 2) If you are unable to solve the entire problem within the time limit, create the output csv with the required number of columns and rows based on a subset of the problem first. Teams which do not make a successful submission for both rounds of the competition will not be considered for the overall ranking.`'",,[Pre-Tertiary] I'm the Best Coder! Challenge 2019,inClass,Round 2,categorizationaccuracy,[pre-tertiary]-im-the-best-coder!-challenge-2019 1125,"'`Welcome to the Kaggle competition part of the Group Project for Statistical Learning and Data Mining (QBUS6810) in Semester 1, 2020.`'",,"QBUS6810 Semester 1, 2020",inClass,The instructions are posted on Canvas,rmse,"qbus6810-semester-1,-2020" 1126,'`Use given attribute of real estate to evaluate the price of real estate in Taipei.`',,Real Estate Evaluation,,,rmse,real-estate-evaluation 1127,"'`In this competition your task will be to predict the price of flats in test.csv. You will be given two datasets: train.csv (contains all features and prices of flats) and test.csv (only features). - test.csv. : train.csv ( ) test.csv ( ). Invite link for this competition: https://www.kaggle.com/t/9f9b2a84befc470b96c2afb5416d76e2`'",,Real Estate Price Prediction,,,r2score,real-estate-price-prediction 1128,"'` . . , . , . , , -> . ! . . , 3 ( ), , , 4, . RocAuc. . . , """" ( ). . ML .`'",,[SF-DST] Recommendation Challenge,inClass,,auc,[sf-dst]-recommendation-challenge 1129,"'` , -. , - . NDCG@10`'",,SkillFactory Recommendations Challenge,inClass,Recommend movies better!,ndcg@{k},skillfactory-recommendations-challenge 1130,"'`Bienvenidos a la ""competencia"" para el trabajo final de la optativa Sistemas de Recomendacin de UNICEN. En las diferentes secciones encontrarn todo lo que necesitan para participar. Objetivo Tienen que predecir qu rating le dar un usuario a determinados libros. Algunos comentarios importantes Pueden usar cualquiera de las tcnicas que hayamos visto en clase u otras que se les ocurran! Puede programar en el lenguaje que quieran. No hace falta falta que usen Python ni la librera que usamos en las notebooks. Creatividad! Una vez cerrada la ""competencia"", debern hacer una presentacin donde expliquen la solucin que idearon. Las presentaciones estn pensadas para la ltima semana de Noviembre, pensando en que se puedan anotar a la primera fecha de Diciembre.`'",,Recommender Systems UNICEN 2019,inClass,Welcome to the challenge reserved to the students of the Recommender Systems Course at UNICEN,rmse,recommender-systems-unicen-2019 1131,"'`What makes students drink beer? We have no idea, but in this competition, we are trying to predict it using weather and weekend data. The data set we have is the total beer consumption in volume, in a student area in one of the world's biggest cities, So Paulo, 2015. You are given all data from January to October, and asked to predict the last two months of the year (those are summer months, by the way). Data analysis When you are given a data set, one of the first things you most likely want to do is analyze the data a bit. We don't have time to do everything ourselves today. On this page, you can find a very nice analysis on this data. Note that the analysis includes the time period for which we are forecasting (November and December). Of course in reality, you cannot analyze the future. But, let's pretend that the November and December data in the analysis are from the previous year (2014)! Now you are able to see the seasonal pattern for the whole year, although you are only given 10 months of data. You can use that knowledge when selecting what kind of variables your model uses, just don't use the exact numbers for anything, okay? Acknowledgements The whole data set is available here, added by Alexandre George Lustosa. Now, you might be tempted to try and cheat a bit by downloading the right answers. But that would not teach you anything, nor help you win: we are looking for a nice predictive model - not the exact numbers.`'",,Relex Beer Challenge,inClass,Predicting the future is difficult. Predicting the past is useless.,mse,relex-beer-challenge 1133,"'`This Kaggle challenge is your last evaluation in the SMESCM00 class (Computational Statistical Methods, Jean Martinet). The challenge will be held between January 23 (Thursday) 13.30 and February 14 (Friday) 20.00. Time is in UTC. NEW: teams of 2 students max allowed (not mandatory). Context The objective is to classify data generated by a Spiking Neural Network (SNN). The general approach is described in this paper. Warning This challenge contains real, unexplored, freshly generated experimental data from the on-going research of our lab. Students may or may not find useful results during the challenge. Inversely, the data may or may not be too easy to process. This makes the challenge realistic, exciting, and challenging. This also make your work particularly important since your results might be useful for a research group. The main point is to find relevant answers, by any means. You are free to use any technique learnt in the class or elsewhere. Experimental settings and data description A webcam is plugged to an event-camera simulator, and shown several stimuli, in the form of object motion in four directions (up, down, left, right). The event-camera simulator is a piece of software written to stimulate an event-camera. The event data is fed to a one-layer SNN with 10 output neurons, and the output activity is recorded for a fixed short duration. One sample output vector is made of the output neuron spike count (onsc) during this duration, i.e. 10 integer values. There are two datasets: BEFORE_TRAINING (BT) obtained on a randomly initialised SNN, and AFTER_TRAINING (AT) obtained with the same network after a short STDP-based unsupervised training. The main dataset is AT. The datasets are obtained by presenting an object (a hand) moving in translation before the webcam with a fixed speed, in one of the four directions. This has been repeated 20 times for each class. Original 80 samples has been rotated by 90, 180, and 270 degrees for augmentation. Moreover, this data augmentation removes any bias related to the video acquisition, since each sequence now belongs in all four classes. The resulting 320 event sequences are fed to the network, and we obtain 320 10-D vectors, that are split into train / test sets. Classes are balanced in all sets. Class numbers are 0 for up, 1 for down, 2 for left, and 3 for right. BTtrain.csv and ATtrain.csv (64 rows, 11 columns) label, onsc1, onsc2, onsc3, onsc4, onsc5, onsc6, onsc7, onsc8, onsc9, onsc10 3 1 1 0 3 0 2 2 4 0 2 1 0 1 0 5 0 2 2 5 0 2 3 0 2 0 4 0 2 2 4 0 1 0 0 1 2 5 0 5 1 3 0 3 2 0 2 2 5 0 6 2 4 0 2 etc. BTtest.csv and ATtest.csv (16 rows, 10 columns) onsc1, onsc2, onsc3, onsc4, onsc5, onsc6, onsc7, onsc8, onsc9, onsc10 0 2 2 5 0 6 2 4 0 2 0 2 2 7 0 7 1 1 0 5 0 3 2 6 0 5 2 1 0 4 etc. sampleSubmission.csv (16 rows, 2 columns) id, label 0, 0 1, 0 2, 0 3, 0 4, 0 etc. Expected work and results Three parts are expected: A statistical analysis to determine whether the data in BTtrain.csv significantly differ from ATtrain.csv. Of course, the hypothesis under test is that the data differ. The main task in this challenge is the training of a model based on ATtrain.csv to correctly classify ATtest.csv. You are expected to work on a Kaggle notebook and to post frequently your results (2 submissions per day allowed). The last task (not to be posted) consists in training the same model on BTtrain.csv to correctly classify BTtest.csv. Do not try too hard to get good results with this dataset, since we expect (and even hope) this result to be bad. Grade The grade will take into account (subject to modifications): (4/10 points) a synthetic description of your solution, including the statistical analysis and detailed explanation of your choices, and why you believe they are good (4-pages max pdf) -- PDF not ipynb. The report is to be submitted on LMS (4/10 points) the last submitted solution for AT_test (2/10 points) the final challenge ranking (max for first) Note that the ranking formula is coef x [ 1 - (rank-1) / (nbTeams-1) ]`'",,SMESCM00 Final evaluation challenge,inClass,Classify SNN output data by all possible means,categorizationaccuracy,smescm00-final-evaluation-challenge 1134,"'`cus, data tam jsou, muzete zacit testovat. Dotazy v pondeli. P.`'",,barx-1,inClass,barx-1,mse,barx-1 1135,"'`Train models for sentiment classification. Each training sample consists of a label and a text field. Label mapping: 0 for negative, 1 for neural and 2 for positive. You can use google colab as the platform for model training and prediction if you want free GPU resource. Here are the instructions of using colab. Acknowledgement: The dataset is extracted from yelp kaggle challenge. By using this dataset, you agree to Yelp Dataset Terms of Use Notes: External dataset is not allowed. Any prediction generated manually or by extracting the labels from the yelp dataset will not be evaluated. You must train your models over the given dataset to make predictions.`'",,Review Sentiment Analysis,inClass,Train models to predict the sentiment polarity of reviews,categorizationaccuracy,review-sentiment-analysis 1136,"'`As part of the Kainos AI Camp, we are running our own (small) Kaggle competition. Unlike regular Kaggle competitions, only other Campers are taking part. The aim is to achieve the highest accuracy. How to form a team You can have a team of up to three people. First you will have to accept the competition rules. To form a team, click on My Team from the dashboard. There you can search for and add other Kagglers to your team. Please note: until your teammates have accepted the competition rules, they will not show up here. Be sure to designate a Team Leader. If you and someone you want to work with have both formed teams, it is possible to merge teams (as long as the total number of people is below 3). The aim of the competition You will be given a dataset containing Amazon reviews, together with a label saying whether they are positive or not. Your job is to create a binary classifier that can detect whether new reviews are positive or negative (this is the same as the Sentiment Analysis twitter task you will have done earlier this week). Unlike the last challenge, this dataset is massive - it's a significant fraction of all Amazon reviews from the last 18 years. The scoring will happen on a part of the dataset that you are not given to train on, so make sure your validation accuracy is high! If you overfit on the training data, your submission will have a low score! The final score is the accuracy (technically the cross-categorical accuracy) that you achieve on the test dataset. How to submit You can make multiple submissions - although you are limited to 10 submissions per day. The top two submissions (as in, the two best submissions) from each team will be scored at the end. Kaggle will automatically select the top two, although you can overrule this if you choose to. Advice There are many different classifiers suitable for doing binary classification. To win this competition, you will probably need to try several different ones and find the most suitable. This is where being in a team is very useful. Prizes Every member of the winning team will get a mystery prize, which is being awarded at the end of the Hackathon on Saturday. We'll announce the winners on the day of the submission deadline, though.`'",,Sentiment Analysis (Belfast),inClass,Kainos AI Camp 2019 - Text classification challenge on Amazon Reviews,categorizationaccuracy,sentiment-analysis-(belfast) 1137,"'`Tweet Polarity Classification: Given a message, classify whether the message is of positive, negative, or neutral sentiment. We use average F1 score over positive, negative and neutral tweets as the scoring metric. Each test line should contain a pair of tweet_id and its predicted sentiment. 264238274963451904,positive`'",,Sentiment Analysis of tweets,inClass,"Classify tweet sentiment - Positive, negative or neutral",meanfscore,sentiment-analysis-of-tweets 1138,"'`: , , . - ( , , ). , . , ( ). . ? , . , . ( ), . : . . . """" ( ). . ( ) ML . ( DL) - . ML , . : , Baseline, !`'",,[SF-DST] Car Price prediction,,,mape,[sf-dst]-car-price-prediction 1139,"'` , . , , RandomForestRegression, TripAdvidor. , , . : , . . . Notebooks (Baseline) !`'",,[SF-DST] Restaurant Rating prediction.,,,mae,[sf-dst]-restaurant-rating-prediction. 1140,"'`Background At Shopee, we always strive to ensure the correct listing and categorization of products. For example due to the recent pandemic situation, face masks become extremely popular for both buyers and sellers, everyday we need to categorize and update a huge number of masks items. A robust product detection system will significantly improve the listing and categorization efficiency. But in the industrial field the data is always much more complicated and there exists mis-labelled images, complex background images and low resolution images, etc. The noisy and imbalanced data and multiple categories make this problem still challenging in the modern computer vision field. Note: This page is for participants from open group! Task In this competition, a multiple image classification model needs to be built. There are ~100k images within 42 different categories, including essential medical tools like masks, protective suits and thermometers, home & living products like air-conditioner and fashion products like T-shirts, rings, etc. For the data security purpose the category names will be desensitized. The evaluation metrics is top-1 accuracy.`'",,[Open] Shopee Code League - Product Detection,,,categorizationaccuracy,[open]-shopee-code-league-product-detection 1141,"'`Submission Opens: 11 July 2020, 12pm (GMT+8) onwards. There will be no submission from 4 July to 10 July 2020. Note: For this challenge, submission will NOT be done on Kaggle. More details about the submission will be revealed over email by 11 July 2020 Background At Shopee, we have customers and sellers across Southeast Asia and Taiwan. In order to provide a better shopping experience, we set up the cross border business to improve the variety of products for our customers. In order to help the local sellers to migrate the SKUs to different foreign markets, we hire a group of professional human translators with an e-commerce background to take care of the product translation. The product information translated includes product title, variation and description. However, as Shopee is growing tremendously in recent years, the amount of translation per day is beyond the human translators capacity. At the same time, with the development of AI technology, machine translation is now deployed in many industrial areas to assist the human translators and they can achieve a near-human level translation quality. In Shopee, we have an in-house machine translation pipeline which can translate millions of SKUs per week in different languages. The languages include Traditional Chinese, Bahasa, English, Vietnames, Thai and Portuguese. The challenges in machine translation is usually the lack of labelled data. However, different ways of unsupervised machine translation have been explored and proven to be effective, with little or even no label data. Some techniques are cross-lingual word alignments or pretrained cross-lingual language models. **Note: This page is for participants from open group! ** Task Given a product title in Traditional Chinese, the candidate is expected to translate the title into English. Dataset: Candidates are provided with two monolingual product title data (in Traditional Chinese and English). Use of public data is encouraged. Metrics: Bleu score of the whole test set is used to assess the translation quality.`'",,[Open] Shopee Code League - Product Translation,,,categorizationaccuracy,[open]-shopee-code-league-product-translation 1142,"'`In Eubacteria, the sigma factor () binds with the RNA polymerase subunit ( 0 ) to create RNA polymerase. Being part of the RNA polymerase holoenzyme, the sigma element acts as the connection between RNA polymerase and DNA. Thus, affinity between the RNA polymerase and DNA is mainly attributed to the bound sigma element. The goal of the competition is to predict the binding of an RNA polymerase with a given sigma factor to a DNA sequence. To the students: A reminder that the end result for this competition is not parallel to the scoring on the exams (regarding this competition). It is perfectly possible for lower scores to outweigh higher scores if a good strategy and insights into the used methods can be provided at the exam. All (cleaned) notebooks will be send to by the end of the competition. No notebooks (-> Kernels) are to be made public for this competition.`'",,Sigma site prediction,inClass,Predict interaction between DNA and sigma-factors in E. coli,mcauc,sigma-site-prediction 1143,"'`This is the example for the Churn Prediction Class We thank Professor Plum, Ph.D. for providing this dataset.`'",,Churn prediction,,,rmse,churn-prediction 1144,"'` -! . https://www.youtube.com/watch?v=vIci3C4JkL0&t=38s : , ! , , - . , -. AUC-ROC, 50% , - 50%. Jian Yuang'a? !`'",,SkillFactory | Final hackathon,inClass,Финальный хакатон с культовой задачей!,auc,skillfactory-|-final-hackathon 1145,"'` . , , . , . - - (ROC-AUC) - , - - , , - - - , , () Kaggle How to Win Kaggle Competition Kaggle Tips and Tricks Winning Tips on Kaggle by Kazanova Kaggle Tutorial Kernels Top 20 Data Science websources`'",,Skill task,inClass,Binary classification task to check overall data science skills,auc,skill-task 1146,"'`Hello There! Welcome to the Denver-Seattle Data Contest! Problem Overview: Emergency responders and news outlets monitor twitter for signs of natural or human-driven disasters. The 2014 Earthquakes in Silicon Valley were detected in 29 seconds using twitter data! How the USGS uses Twitter to track earthquakes Teams will have access to data (see data tab for more info) containing tweets from individuals corresponding to a few different types of disasters mixed with tweets from the same time period. Our quest is to create a model which predicts whether or not a tweet seems to pertain to a disaster and then create a visualization/presentation describing findings, insights and your teams approach. The visualization/presentation will be evaluated at Demo/Celebration time. This problem is very similar to and was inspired from an existing Kaggle Competition Real or Not? NLP with Disaster Tweets Teams may find some of the resources posted to that competition useful for learning about machine learning and natural language processing! Please note, the data used for this contest is different than the data used for Real or Not? NLP with Disaster Tweets. Please use whatever resources you find the most helpful! Acknowledgement for the data will be provided at the end of contest as to prevent ambitious web-scrapers from competing in a way that misses the spirit of the contest :)`'",,Slalom _Build Data Contest,inClass,Denver-Seattle Data Contest,meanfscore,slalom-_build-data-contest 1147,"'`Introduction People turnover numbers were always important for any organization, especially in the case of their continued growth the more the organization is growing the more these numbers become critical. Therefore, it is more significant now than ever to make sure that the number of voluntary dismissals is very low. As this would lead to huge savings related to the cost per hire, including advertising, recruiting, training, internship expenses, and variance in salary between hired and dismissed employees. Obviously, the most cost-efficient and effective approach overall is to manage unwanted dismissals by preventing them in a timely manner before a person makes the final decision to leave the company. We propose to participate in the competition on best solution in the area of HR Analytics and try to develop the predictive model, which provides the assessment of the Employees Dismissal Risk, based on the historical data analysis. Problem Description The major goal of the competition - develop a model, which provides an answer to the question - ""Will the particular Employee leave the Company within the next three months?"" Solution should be made with respect of the following business goals: Managers should have a notification about Employees even in case of insignificant risk of Dismissal Managers should get an information regarding Dismissal drivers - the interpretability of the prediction result is desirable Managers should get the recommendations regarding potential actions for Employees retention. Point 1 is mandatory for this competition, points 2-3 are optional and will be assessed independently. Expected results With respect of business needs, the following outcomes are expected: Assignment of binary label to each employee (0 - Not At Risk, 1 - At Risk), for the next three month after observed period Analysis of the historical data and modelling results in order to define the factors that have the most significant impact to the Employees Turnover. Reasoning and conclusions should be represented in view of report. Sensitivity (what-if) analysis approach description with respect of data insights and modelling results Results Assessment Task 1 will be assessed formally via calculating success score, which reflect the accuracy of the solution with taking into account business criteria. Tasks 2-3 will be assessed for 15 teams who will show high results for the main task and are considered as separate nomination Tricky Moment Please, pay attention that we do not provide target variable, but have already generated it for your results validation. So, it's critically important to provide correct binary flag to each employee in dataset with respect of his historical data for submission. For instance if some Employee left the Company in Oct 2017, the target flag for him should be represented as: Employee ID Date target 1 2017/04 0 1 2017/05 0 1 2017/06 0 1 2017/07 1 1 2017/08 1 1 2017/09 1 The record for the last month (2017/10) is absend due to the reason that employee left company in this month doesn't have a target for the next period.`'",,SoftServe DS Hackathon 2020,,,f_{beta},softserve-ds-hackathon-2020 1148,"'`Rain or Not? For this assignment you work on training a classifier to predict whether it is going to rain tomorrow in Seattle or not. You are provided with a dataset with historical precipitation recordings from 1948. You will test your prediction against data from 21st century (read the Data tab for more details on the data). You will get more practice doing machine learning in the ""real world"" in the assignment this week. For many ML tasks, you will be given a dataset and your goal is to find the model that will do best in the future by using that dataset to train the model. Instead of risking deploying a bad model into production, a great test to verify that the model will work in the future is to evaluate it based on its accuracy on data it has never seen before. For this assignment, we will do just that! You will have access to a training set with labels and a test set without labels. Your task is to train a model that will do optimally in the future, which will be assessed by looking at your predictions on this test dataset you haven't seen the labels for. Kaggle is a good medium for simulating what it is like to deploy models in the real world (or at least one step closer than doing homework)! It assesses your model on these unseen examples and gives you an indication of how you will do by reporting your accuracy on a portion of the test set. This is seen as the ""leaderboard"" when you submit. The leaderboard is only meant to give you an indication of how your model might do in the future, but the final evaluation will be on data that has never been seen before or used to calculate leaderboard positions. This means you should take the leaderboard scores with a grain of salt since the final evaluation will be on a different dataset. The idea of being assessed on data you have never looked at before might seem a bit unnerving at first, but this is exactly the scenario that happens whenever an ML practitioner deploys a new model. Uncertainty is unavoidable in the world, so we need to use the ML best practice we have learned in the class so far as to assure us that our model will do well in the future. As a note, this assignment is NOT graded as a competition. The Kaggle leaderboard is just supposed to be a fun way for you to compare your model against your colleagues and the rankings are not used in grading. All that matters is how your model does in the end, not how it competes with everyone else's (more on grading in the Evaluation page). Assignment The programming portion of this assignment involves writing code to train a model that accurately predicts whether it is going to rain tomorrow in Seattle based on observations from the previous days. There is no starter for this assignment so you will need to write everything based on what we have done in previous assignments. The programming quiz for this assignment consists of several questions to guide your process. You also need to turn in your notebook code on Gradescope (it could be several notebooks): make sure they explain the process of what you are doing (use markdown cells), together with the prediction submissions file. Teams should use group submission option. Like all other assignments, there is still a set of concept questions you should complete for this week on Gradescope. Here are several components we expect you to complete: Run at least 3 different types of classifiers (different methods: you can additionally run different versions of individual methods with different parameters and features) for prediction Visualize the ROC curves of at least 3 different classifiers on the same plot Briefly discuss what you are inferring from this plot Briefly discuss what features you used and whether you did some transformations on them. Why? What features seemed important for your final model? Explain how you set up your validation. Explain what did you do to improve your initial predictions. Did it help? Here is a helper notebook to check how to plot several curves on the same graph. Team Formation For this homework, you are allowed to work in a team of up to 3 people. Each team will submit a joint assignment. To facilitate the process, you can sign up in the spreadsheet in the canvas announcement (I advise you not to spend time on selecting your teammates but simply sign up in order: it is an opportunity to meet a new classmate and in real life we often do not have an opportunity to select our collaborators, so it is a good skill how to navigate the process). I leave it up to you to decide how to communicate: it could be email, zoom meeting, canvas, or share google docs and notebooks and comment on them. It may turn out that one of your classmates is at a different time zone: be considerate and find a way to share progress regardless. The deadline to sign up on the sheet is Monday 5:00 pm PDT. Sign up early so that you can start on the assignment. If you work with a team you have to contribute to the work and submit a joint submission. Each team will be able to submit up to 6 submissions per day. Remember that you can do your evaluation locally, i.e. you do not have to rely on the Leaderboard to achieve good performance. We will use your final submission for the evaluation. Once your team is formed you can form your team on Kaggle. When you make a submission on Kaggle, by default it will use your username on the leaderboard. Change your ""team name"" in the Team tab which will change what shows up in the leaderboard. For simplicity, use {# in spreadsheet}-TeamName: for example 1-first-team. Even if you are submitting as an individual user you can use this option so your name does not show on the leatherboard (remember the competition website is public). User/team names must not be offensive and should be school appropriate. If we are not able to find which submission on Kaggle is yours, you will not receive credit for what you submit on Kaggle. Due Date Follow the Gradescope due dates. The Kaggle's competition end is two days later to allow for late submissions.`'",,CSE/STAT 416 - Homework 5,,,categorizationaccuracy,cse/stat-416-homework-5 1149,"'`Important Update! Hi eveyone, there are some issues with the Kaggle dataset server, it seems that the dataset is not shown in its latest version, and it is not possible to update it. For this reason, I set up a new challenge page that you can find here: https://www.kaggle.com/t/f3dfa748f2da48ee816786ced50e11de I also updated the link in the GoogleDoc document so other students will not use anymore this page with the old dataset version. The official challenge is no more on this page, but instead, on the link I posted before. Guglielmo`'",,This challenge moved to another web page,,,categorizationaccuracy,this-challenge-moved-to-another-web-page 1150,"'` Let's get loud! If you ever tried to play around with vinyl records and classic heavy-duty turntables, or even with lighter, shiny and more contemporary DJ-gears, the bare minimum you had to learn was how to sync the beat of two songs and come up with a smooth, silky transition between them: master that and you're halfway closer to success, as simple as that! The musical aspects of tempo, beat, and rhythm play a fundamental role for the understanding of and the interaction with music . It is the beat, the steady pulse that drives music forward and provides the temporal framework of a piece of music. Intuitively, the beat can be described as a sequence of perceived pulses that are regularly spaced in time and correspond to the pulse a human taps along when listening to the music. The term tempo then refers to the rate of the pulse. Musical pulses typically go along with note onsets or percussive events. Locating such events within a given signal constitutes a fundamental task, which is often referred to as onset detection. Tempo Estimation While many different tempo estimation techniques have been proposed, recent comparative studies suggest that there has been a relatively small improvement in the state of the art recently. Current approaches for tempo estimation focus on the analysis of mainstream popular music with clear and stable rhythm and percussion instruments, which facilitates this task. These approaches mainly consider the periodicity of intensity descriptors to locate the beats, and then to estimate the tempo. Nevertheless, they usually fail when they are analyzing other music genres like classical music, because this type of music often exhibits tempo variations. Your Goal easy, contrary to many existing systems, which typically first identify beats and then derive a tempo, here you will try to estimate the tempo directly and blindly (read ""ignorantly"") from generic spectral signatures extracted from 8 seconds long song-snippets (see the Data section for details).`'",,"Statistical Learning (Sapienza, Spring 2020)",inClass,It's the final Hacka!,rmse,"statistical-learning-(sapienza,-spring-2020)" 1151,"'`9 features(satisfactionlevel, lastevaluation, numberproject, averagemontlyhours, timespendcompany, Workaccident, promotion last 5years, Department, salary) . https://www.kaggle.com/pankeshpatel/hrcommasep`'",,SejongAI..[],,,categorizationaccuracy,sejongai..[] 1152,"'`Submission Opens: 11 July 2020, 12pm (GMT+8) onwards. There will be no submission from 4 July to 10 July 2020. Note: For this challenge, submission will NOT be done on Kaggle. More details about the submission will be revealed over email by 11 July 2020 Background At Shopee, we have customers and sellers across Southeast Asia and Taiwan. In order to provide a better shopping experience, we set up the cross border business to improve the variety of products for our customers. In order to help the local sellers to migrate the SKUs to different foreign markets, we hire a group of professional human translators with an e-commerce background to take care of the product translation. The product information translated includes product title, variation and description. However, as Shopee is growing tremendously in recent years, the amount of translation per day is beyond the human translators capacity. At the same time, with the development of AI technology, machine translation is now deployed in many industrial areas to assist the human translators and they can achieve a near-human level translation quality. In Shopee, we have an in-house machine translation pipeline which can translate millions of SKUs per week in different languages. The languages include Traditional Chinese, Bahasa, English, Vietnames, Thai and Portuguese. The challenges in machine translation is usually the lack of labelled data. However, different ways of unsupervised machine translation have been explored and proven to be effective, with little or even no label data. Some techniques are cross-lingual word alignments or pretrained cross-lingual language models. **Note: This page is for participants from student group! ** Task Given a product title in Traditional Chinese, the candidate is expected to translate the title into English. Dataset: Candidates are provided with two monolingual product title data (in Traditional Chinese and English). Use of public data is encouraged. Metrics: Bleu score of the whole test set is used to assess the translation quality.`'",,[Student] Shopee Code League - Product Translation,,,categorizationaccuracy,[student]-shopee-code-league-product-translation 1153,"'`Potato Batata Suelen?`'",,Suelenator: Final Project,,,wmae,suelenator:-final-project 1154,'``',,Summer School 2020,inClass,competition,meanfscorebeta,summer-school-2020 1155,"'`Debido a la simplicidad experimental, la formacin de una capa delgada mediante la tcnica de deposicin y autoensamblaje (SA) se ha estudiado ampliamente con diferentes tipos de biomolculas de naturaleza anfiflica, que tienen una tendencia a ser adsorbidas de diferentes formas en distintos materiales y formar patrones especiales mediante autoensamblaje. La base de la formacin autoensamblaje de las molculas en la superficie es la atraccin entre la interfaz funcional y las molculas y entre molculas que interactan en un medio acuoso. La reconstruccin de la topografa es llevada a cabo mediante el barrido de una tip a travs de la superficie ( X1, X2) y medir la altura (Y) mediante la deflexin de la tip. Dicho proceso es mostrado en la figura 1.`'",,Protein deposition,inClass,Reconstruyendo la topografía usando un Atomic Force Microscopy,r2score,protein-deposition 1156,"'`ACM Machine Learning Contest - 1 This contest is the beginner level machine learning competition. You are free to use any language/library/resources you would want to use. It is specifically aimed at machine learning enthusiasts at SVNIT. I am a beginner. What to do? There is a page that reads getting started . It has the basic tutorials on how to get started with Kaggle. Here is a list of tutorials you would find useful to solve this problem. Numpy Documentation Pandas Documentation Scikit-Learn Documentation Tutorial on Random Forest Classifier Acknowledgements @vaibhavgeek @akshaybusa`'",,ACM Machine Learning (SVNIT),inClass,This is the basic competition on random forest classifier Level: Beginner Level,auc,acm-machine-learning-(svnit) 1157,'`This is the home page of the SYDE 522 data challenge (Assignment 3) for Winter 2020. The task is 8-class image classification. You can have up to 3 submissions per day.`',,SYDE 522 (Winter 2020),inClass,Data Challenge for SYDE 522 (Assignment 3),categorizationaccuracy,syde-522-(winter-2020) 1158,"'`TAMU DGCI Estimation This activity is based on the TAMU DGCI, which aggregates turf grass images. The aesthetic quality of turf is often evaluated in field studies. In some contexts, digital image analysis provides a quantitative means to inform researchers. The dark green color index (DGCI) is a criterion aimed at direct comparison with visual ratings. The goal is to use data science and machine learning to estimate the missing DGCI indices.`'",,TAMU DGCI,,,rmse,tamu-dgci 1159,"'`Hello and welcome to our first Kaggle class exercise. Here we will practice all what we had learn until today. In this exercise we will practice the algorithms for predicting regression outcomes. You can use R or Python to calculate your predictions. Acknowledgements This dataset.was obtained from the UCI Machine Learning Repository The dataset was kindly donated by Hadi Fanaee-T from the Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto, Portugal`'",,TCDS - Machine Learning #1,inClass,Our first class Kaggle challenge with machine learning,mae,tcds-machine-learning-#1 1160,"'`In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from cartographic variables. The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices. If you have any questions, please contact us @ tdac.tech@gmail.com`'",,TDAC Forest Cover Classification,inClass,Correctly predict the type of forest cover. This is a multiclass classification challenge. ,categorizationaccuracy,tdac-forest-cover-classification