In this Jupyter Notebook, we will introduce the basic concepts of machine learning and walk through a few simple examples.
Machine learning is a subfield of artificial intelligence (AI) that involves the use of algorithms and statistical models to enable machines to improve their performance on a specific task through experience. In other words, machine learning algorithms can automatically learn and improve from data, without being explicitly programmed.
There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning involves training a machine learning model on a labeled dataset, where each input has a corresponding output. The goal of supervised learning is to learn a mapping from inputs to outputs that can generalize to new, unseen data.
Unsupervised learning involves training a machine learning model on an unlabeled dataset, where the goal is to discover patterns and structure in the data without any prior knowledge of what those patterns might be.
Reinforcement learning involves training a machine learning model to make decisions based on feedback from the environment. The goal of reinforcement learning is to learn a policy that maximizes a reward signal over time.
Figure 3.1. Iris versicolor. Source: Photo by Danielle Langlois. July 2005 (Image modified from original by marking parts. “Iris versicolor 3.” Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons.1)
First, let's import the necessary libraries and load the dataset:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
iris = load_iris()
X = iris.data
y = iris.target
The load_iris function loads the Iris dataset, while the train_test_split function is used to split the dataset into training and testing sets for supervised models. We will use the accuracy_score function from the same library to evaluate the performance of classifiers. We also import matplotlib.pyplot to create some plots.
Next, let's visualize the data by creating a scatter plot of the sepal length and sepal width:
# we'll use only the first two features for visualization purposes
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.show()
For supervised learning, we need to split the data into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
We use a test size of 20% and set the random state to ensure reproducibility.
Model 1: K-Nearest Neighbours (KNN)
KNN is a simple machine learning algorithm used for classification and regression. Given a new, unlabeled observation, KNN predicts its class by finding the K closest labeled observations in the training set (based on some distance metric) and taking a majority vote among their labels. For example, if K=3 and the three closest observations have labels A, A, and B, the prediction for the new observation would be A.
Let's create an instance of the KNeighborsClassifier class and fit the model to the training data:
from sklearn.neighbors import KNeighborsClassifier
k = 3
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)
Finally, let's evaluate the performance of the model on the testing set:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 1.0
Model 2: Logistic Regression
Logistic regression is usually used for binary classification. Given an input observation, logistic regression outputs the probability of the positive class (i.e., the class with a label of 1) as a function of the input features. The probability is modeled using the logistic function (also known as the sigmoid function), which maps any input value to a value between 0 and 1.
Logistic regression can be extended to handle multi-class classification problems using one-vs-all (OvA) or softmax regression. In OvA, separate binary logistic regression models are trained for each class, and the class with the highest probability is predicted for a given observation. In softmax regression, a single model is trained to output a probability distribution over all possible classes.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 1.0
Let's plot our predictions:
plt.scatter(X_test[:,0], X_test[:,1], c=predictions)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('Predicted Iris Types')
plt.show()
Model 3: K-Means Clustering
K-means clustering is a popular unsupervised machine learning algorithm used for clustering data into K groups or clusters. Given a dataset, the goal of K-means is to find K cluster centers that minimize the sum of squared distances between each data point and its assigned cluster center.
from sklearn.cluster import KMeans
# create a KMeans model and fit it to the data
model = KMeans(n_clusters=3)
model.fit(iris.data)
# use the model to make predictions on the data
predictions = model.predict(iris.data)
Let's visualise the data with cluster centers overlaid:
plt.scatter(X[:, 0], X[:, 1], c=model.labels_, cmap='viridis')
plt.scatter(model.cluster_centers_[:, 0], model.cluster_centers_[:, 1], marker='x', s=100, linewidths=3, color='red')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('K-Means Clustering with K=3 on Iris Dataset')
plt.show()
The breast cancer dataset contains information about breast cancer tumors and their diagnosis as either malignant or benign. The data was collected by the University of Wisconsin Hospitals, Madison and consists of 569 instances, each with 30 numeric features representing measurements of the tumors. The target variable is a binary label indicating whether the tumor is malignant (coded as 0) or benign (coded as 1).
The features in the dataset represent various characteristics of the tumors, including their size, shape, texture, and other characteristics measured from digital images of the tumors. Some examples of the features include radius, texture, perimeter, area, smoothness, compactness, concavity, and symmetry.
Let's start again by importing required libraries and dataset:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import numpy as np
# load the breast cancer dataset
data = load_breast_cancer()
# take a look at the data
print(data['data'].shape)
print(data['target'][:10])
print(data['feature_names'])
(569, 30) [0 0 0 0 0 0 0 0 0 0] ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension']
# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# create a RidgeClassifier model and fit it to the training data
model = RidgeClassifier(alpha=1.0)
model.fit(X_train, y_train)
# use the model to make predictions on the testing data
predictions = model.predict(X_test)
# calculate the accuracy of the model
accuracy = accuracy_score(y_test, predictions)
print('Accuracy:', accuracy)
Accuracy: 0.956140350877193
In this example, we created a RidgeClassifier model with a regularization parameter (alpha) of 1.0 and fit it to the training data. We used the model to make predictions on the testing data and calculated the accuracy of the model.
To visualize the impact of regularization, we create a plot of the model coefficients using plt.plot(model.coef_.ravel()).
plt.figure(figsize=(15, 4))
plt.plot(model.coef_.ravel())
plt.xticks(np.arange(len(data.feature_names)), data.feature_names, rotation=90)
plt.xlabel('Features')
plt.ylabel('Coefficient Value')
plt.title('RidgeClassifier Coefficients')
plt.show()
Cross-validation is a technique used in machine learning to evaluate the performance of a model on a limited dataset. The basic idea behind cross-validation is to divide the dataset into multiple folds or subsets, train the model on some of the folds and test it on the remaining fold(s), and then repeat the process for all possible combinations of folds.
One of the most commonly used types of cross-validation is k-fold cross-validation. In k-fold cross-validation, the dataset is divided into k equally sized folds. The model is then trained on k-1 of the folds and tested on the remaining fold, with the process repeated k times such that each fold is used as the test set exactly once. The results from each iteration are then averaged to produce an estimate of the model's performance.
The Boston Housing dataset contains information about housing prices in Boston, Massachusetts, and consists of 506 instances, each with 13 features representing various attributes of the homes and their neighborhoods. The target variable is the median value of owner-occupied homes in thousands of dollars.
Here's an example of using k-fold cross-validation in scikit-learn to evaluate the performance of a linear regression model on the Boston Housing dataset.
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score
# load the Boston Housing dataset
boston = load_boston()
# create a linear regression model
model = LinearRegression()
# perform k-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, boston.data, boston.target, cv=kf, scoring='neg_mean_squared_error')
# print the mean squared error
print(f"Mean Squared Error: {-scores.mean()}")
Mean Squared Error: 23.488595677968725
In this example, we loaded the Boston Housing dataset using load_boston and created a linear regression model using LinearRegression. We then used k-fold cross-validation to evaluate the performance of the model on the dataset, using 5 folds (n_splits=5) and shuffling the data before splitting (shuffle=True) to ensure that each fold contains a representative sample of the data. We used the negative mean squared error (scoring='neg_mean_squared_error') as the evaluation metric, as is commonly done for regression problems. Finally, we printed the average mean squared error across all folds to get an estimate of the model's performance.