Lung Cancer Detection using Machine Learning

Lung disease is one of the most common disease that is affected in our early stage to improve the rate of patients survival.For the radiologist the diagnosis of cancer is the most challenging part.An intelligent computer aided system is very much helpful for radiologist. Various studies for detection of lung cancer with the ML techniques. To predict the lung cancer mostly multi-stage classification is used.The classification system used for data enhancement and segmentation has been done. The segmentation method uses Threshold and marker-controlled watershed and binary classifier for classification method Lung cancer detection has higher degree of accuracy. The dataset is trained with various algorithms like Support Vector Machine (SVM), K- Nearest Neighbour, Decision Tree, Logistic Regression, Naïve Bayes and Random Forest using these algorithms higher accuracy is proven. An enhanced performance level of 88.5% accuracy has been produced with the Random forest algorithm.


I.INTRODUCTION
Lung disease causes several risk factors such as age, sex , diabetes, high blood pressure ,cholesterol, abnormal pulse rate and other factors. Many techniques in data mining and neural networks have been taken to bring out the of lung disease . The algorithms used for this disease is classified based on various methods like K-Nearest Neighbor Algorith m (KNN), Decision Trees (DT), Naive Bayes(NB), Random fo rest , Log istic Regression, Support Vector M,achine(SVM ) . The nature of lung disease is very complex and hence, the disease must be handled in a proper way. The opinion of lung cancer in med ical science and data min ing are us ed for discovering various metabolic syndromes. Data mining with classificat ion and clustering plays an important role in the prediction of lung disease and data investigation. Many problems damages the lung and even causes premature death.The people who are having the highest risk of getting cancer because of smoking. Lung cancer is divided into two types , namely non-small cell lung cancers(NSCLC) and small cell lung cancers(SCLC). It also kills more people whose who have affected by breast cancer,colon,prostate or ovarian.lung cancer when their age increases to 60. Lung cancer is the growth of unusual cells in our lungs.These lung tissues divide rapidly and form tumo rs. There are some of the new techniques to analyse the lung cancer . such as Thalassemia, Constrictive pericard itis, the person's resting blood pressure(trestbps), fasting blood sugar(FBS), Electrical activity of our hearth at rest(Restecg), Maximu m heart rate achieved(Thalach) and Coronary Calciu m scan(CA)these are the data sets that needs to be trained and tested . where 75% of the data is used for train ing and remaining 25% is used for testing.

LUNG CANCER INCIDENCE PREDICTION USING MACHINE LEARNING ALGORITHMS.
The cancer is called malignant tumor caused by an irregular d ivision of tissue or organ in cells.Although many types of cancer in both male and female .It is based on Statistical analysis of its data and neural-based models . The goal of the Back propagation algorith m is used for mu lti-layer perceptron to update the weight of te neurons with the gradient descent algorithm.Generally Init ial weight are assigned randomly and it starts the input in its feeding to net and calculating its total potential by their corresponding weights following the hidden layer.The output produced by its Activation functionof each neuron and the calculations are repeated till output layer.At that layer the output is compared with target and error.It is used in real-life applications ,optimized problems and prediction. The Long Short term Memory Netwo rk(LSTM ) is effective of recurrent network and used for the classificat ion method and prediction analysis.The major co mponents are cell, input gate, output gate and forget gate. Forget gate is used for the irrelevant data and input gate removal and accepts the forget gate.Output of LSTM uses the sigmoid act ivation function.It uses weights,it remembers the previous errors and min imization of network.The Support Vector Regression is kind of support vector machine to accept the real value in the binary format Prediction problem is used effectively.It creates the sub-class of training data known as support vectors and minimizes the observed data and predicted data distance to improve its performance.

MULTIS TAGE LUN G CANCER PREDICTION AND DET ECTION US ING MULTICLASS SVM CLASSIFIER.
Lung cancer using Image Enhancement technique of better quality imp ression pictures. Images used for this purpose masking is subjected to a selective med ian filter. More dependable for preparing and identification of more dependable to consolidate the versatile edge to its own commotion recognition that is used. The Image Segmentation and Detection of canmcerin CTimages to gain its own better recognition of the image.The watershed lines depends to compare the edges of the marker and it is not influenced to its lower-constract edgesto problem that needs to be solved in neighbourhood minima.The Feature Extract ion and Detection involves in huge arrangement of info rmation that needs to be arranged in decreades .Feature Ext raction of the normality or variation of normal picture. Feature Ext raction arrange a large combination of pixels .Classification of cancer nodule of utilized SVM classifier of the machine learning algorith m.SVM uses the effective tool for hyper p lane classifier that expands the edges, Cancer stage classification of the affected lung area and the total affected area.

SEX AND S MOKING S TATUS EFFECTS ON THE EARLY DET ECTION OF LUNG CANCER IN HIGH RIS K S MOKERS US ING AN ELECTRONIC NOSE.
Respiratory diseases like asthma, chronic pulmonary disease are identified by breath odor .This is because of the equilibriu m of the air and pulmonary blood gas in the breath analysis of diagnostic disease in the lung cancer .An electronic nose (e-nose) is VOC in that array of the sensors with overlapping the digital VOCs.They also detect with the chemical reaction and can generate an electrical impulse.These are the sensors coated with reactive co mpound depends mainly on chemical constituents, an electrical sensor causing measurable resistance change.Data that it needs to be obtained in the fo rm of pattern recognition technique to obtain the specific odor. E-nose which is capable of measuring a non-invasive breadth samples in the real t ime analysis of chemicals. The e-nose system which contains 32 polymer sensors with some unique pattern of electrical resistance.e-nose for lung cancer patients used to facilitate the lung cancer in advanced stage,effects due to smoking comparison of high risk current .To study its majority and design in a cross sectional case control for the lung cancer patients for detected cancer.It has been ranged fro m 45-79 years of male or female fro m the history of past years.The "High risk Smo ker" and "Lung cancer" in terms o f sex .It is based on the LDCT lung cancer detection.

2.4-AUTOMATIC DETECTION OF ABNORMALITIES IN LUNG RADIOGRAPHS CAUSED BY PLANOCELLULAR LUNG CANCER.
Automatic algorith m for early p lanocellu lar detection using the lung Xray images , the fact that the lung cancer in this stage is detected too late , early d iagnosis uses the radiography to analyxe the diagnostic tools . It determines lung cancer using the Xray images , it involves the comparison between the extracted planocellular lung cancer structure and it has been analysed with lung Xray image by calculating with its coefficients and finding its maximu m coefficient which indicates the suspected cancer affected area of the lung image .The result compro mises the proposed algorith m that the lung cancer is predicted earlier.In this method the lung cancer is detected later that the p lanocellu lar has been extracted in the early method of using these analysis in scope of detecting the lung cancer.

III. PROPOSED METHODOLOGY
The proposed system will add some of the datasets for the lung cancer detection such as age,sex,Constrictive pericardit is(CP), person's resting blood pressure(trestbps), Cholesterol, Fasting blood sugar(FBS), Records the electrical activity of your heart at rest (Restecg), person's maximu m heart rate achieved(Thalach), peak weaves(exang), the J point (the point o f infection at the junction of the S wave and ST s egment) becomes depressed during exercise therefore solpes sharply upwards(SLOPE) , Coronary calciu m scan(CA), Thalassemia(thal) are the datasets that needs to be trained and tested, from these datasets the target of affected and not affected by lung cancer. It uses the Random forest algorith m ,K-nearest neighbour (KNN), Support Vector Machine(SVM ), Decision tree algorith m, Logistic Regression, Naïve bayes algorith m to find out the highest accuracy and it compares with the target be be achieved will be the predicted result.

1.PREPROCESSING TECHNIQUES
The datasets obtained fro m the Kaggle.co m which contains of the 303 datasets .We are importing nu mpy library function fro m the pandas the store the data in the csv file format, the matplotlib and seaborn library function is used for the v irtualization .seaborn function is used to polt the numerical values in the form of graphs, it also involves the estimation of the percentage, mean value and standard deviation then it scaling to unit variance , it finds the average value of people affected and not affected . About 70% values lie in its between -1 to 1. By using the pandas function we are finding the target .

CLASSIFICATION OF DATA
Using the sklearn.model_selection we are importing the train_test_split. This technique of the train_test_split used for the evaluation o f mach ine learning algorith m performance. The process of this is to take the datasets and to datasets needs to be cleaned . It involves taking the datasets and dividing it into two subsets . The datasets has to be trained and tested , the datasets splits up of 303 datasets to (203) datasets needs to be trained and (100) datasets needs to be tested., it needs of classificat ion and regression problem.

K-NEAREST NEIGHBOUR ALGORITHM
K nearest neighbors is the simp lest form of the mach ine learn ing algorith m which is purely based on Supervised learning technique , this algorithm assumes many similarity between the no of cases and data available for cases and new data cases .The KNN algorith m uses 'feature similarity ' to find the values of new data values. It is also called as the lazy learner algorith m ..First it selects the K nearest neighbour , it calcu lates the the Euclidean distance K Nu mber of neighbours. Among these K neighbour the number of data points are catagorized .Assign its data points to which the neighbour has maximu m. It gives the range of 20 neighbour and prints the result in the form of graph.

RANDOM FOREST ALGORITHM
Random forest algorith m is also known as Supervised machine learning algorith m, and it has been briefly used in the Classification and Regression problems.The different samples of decision trees takes the major vote for classification and regression.The data sets that contains the continuous variables of the random forest algorith m has the regression and categorical variables of classification.It takes place in the ensemble technique. Ensemb le means co mbining its mult iple models,and it's a collection of models it is used to make pred ictions to an individual model

DECISION TREE ALGORITHM
The decision tree algorith m usually belongs to family of supervised learning algorit ms.The decision tree algorith m is used for solving the Regression Problem and classification problems.It trains the model and it is used to predict the class or value that the target variable used for learning simp le decision for their priored data. Predicting a class label to start the root of the tree.It has two nodes Decision node and leaf node , it is used to make the decisions and have a multip le branches of the leaf nodes are the output of those decisions .It uses a graphical representation for getting a possible solution to that problem of the conditions.

LOGISTIC REGRESSION
Lo gistic regression is known as a supervised learning algorith m that can be used to predict its target variable on a dependent category , it has a large set of data in the logistic regression , it can be either yes or no , true or false etc.it g ives its value between 0 and 1 of probabilistic values. Logistic regression is very similar to linear regression ,it fits ab S shaped logistic function with two maximu m values 0 or 1. Logistic function has the curve indicates the cells are cancerous. Logistic regression is known as significant machine because it provide probabilit ies of new data of continuous and discrete datas.

NAÏVE BAYES ALGORITHM
Naïve bayes algorithm is known as supervised learning algorith m based on its bayes theorem of solving its classification problem , it is used in classification of the im age text wh ich has its high-dimensional datasets, it is one of the simplest classification algorith m , building fast mach ine learning models , it has a probabilistic classifier it is used to predict its probability of an object on the basis. It is comprised of Naïve and bayes , it has certain feature of independent of other features basedv on its color , shape and taste recognized.

IV RESULT
The pro ject mainly focuses on detecting of lung cancer using machine learn ing algorithm to find out the highest accuracy level using logistic regression, rando m forest, naïve bayes , k-nearest neighbour and decision tree with the datasets of age, sex ,cp, trestbps, chol, fbs, restecg, thalach, ca using these datasets accuracy is detected. The higher accuracy of the algorithm is

V CONCLUSION
Lung cancer causes the cancer -related worldwide. Those 60% diagnosed with the lung cancer die after diagnosis for all the patients with lung cancer .Although the mo lecular pathology has lung tumor achieved with targeted treatments .Lung cancer diagnosis is essential for its selection of appropriate curative of non-invasive procedures. Thorax computerized to mography (CT) and its positron emission tomography are used for its non-invasive techniques .Machine learning techniques were used to process it in the raw data and provide a new novel in its lung cancer detection . However, the d isease must be controlled in every stage and measures must be adopted . The future course that in the machine learn ing algorith ms the effective way to predict the lung cancer must be diagonized inits similar ways to analyse .Many feature selection methods are to be involved in the process selection method to predict the lung cancer .