Project 4: Email Spam Detection
Introduction
In this project, we will utilize classification algorithms such as Support Vector Machines (SVM), k-Nearest Neighbors (k-NN), and Decision Trees to build a spam detection system. The goal is to classify emails as either 'spam' or 'not spam' based on their content.Understanding Email Spam Detection
Email spam detection is a crucial application in the field of machine learning and natural language processing (NLP). Spam emails can be defined as unsolicited messages typically sent in bulk to promote products or services. A successful spam filter needs to be able to learn from past data and accurately predict whether new emails are spam or not.Dataset
For this project, we will use the Enron Email Dataset, which is publicly available and contains a large number of emails from employees of the Enron Corporation. Each email can be labeled as 'spam' or 'not spam.' For our purposes, we'll focus on the content of the emails, which includes subject lines and the body text.Data Preprocessing
Before we start building our models, we need to preprocess the data: 1. Text Cleaning: Remove punctuation, special characters, and convert text to lowercase. 2. Tokenization: Split the text into individual words or tokens. 3. Stop Word Removal: Remove common words that do not contribute to the meaning (e.g., 'the', 'is', 'in'). 4. Vectorization: Convert text into numerical format using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).Example Code for Data Preprocessing
`
python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_splitLoad the dataset
emails = pd.read_csv('emails.csv')Replace with the actual path to your dataset
Basic text cleaning function
def clean_text(text): text = text.lower().replace('[^ -]+', '') return textApply cleaning
emails['cleaned'] = emails['text'].apply(clean_text)Vectorization using TF-IDF
vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(emails['cleaned']) Y = emails['label']Assuming 'label' contains 'spam' or 'not spam'
Train-Test Split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)`
Model Training
We will implement three classification algorithms: SVM, k-NN, and Decision Trees.1. Support Vector Machine (SVM)
SVM is effective in high-dimensional spaces and is particularly useful for text classification tasks.`
python
from sklearn.svm import SVC
from sklearn.metrics import classification_report
Train the model
svm_model = SVC(kernel='linear') svm_model.fit(X_train, Y_train)Predictions
svm_predictions = svm_model.predict(X_test) print(classification_report(Y_test, svm_predictions))`
2. k-Nearest Neighbors (k-NN)
k-NN is a simple algorithm that classifies data points based on the classes of their nearest neighbors.`
python
from sklearn.neighbors import KNeighborsClassifier
Train the model
knn_model = KNeighborsClassifier(n_neighbors=5) knn_model.fit(X_train, Y_train)Predictions
knn_predictions = knn_model.predict(X_test) print(classification_report(Y_test, knn_predictions))`
3. Decision Trees
Decision Trees create a model that predicts the value of a target variable based on several input variables.`
python
from sklearn.tree import DecisionTreeClassifier
Train the model
tree_model = DecisionTreeClassifier() tree_model.fit(X_train, Y_train)Predictions
tree_predictions = tree_model.predict(X_test) print(classification_report(Y_test, tree_predictions))`