Article Id – Article id unique given to the record
Article – Text of the header and article
Category – Category of the article (tech, business, sport, entertainment, politics)
Consider BBC News as corpus for implementing question 1
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Load the BBC News dataset
df = pd.read_csv(r"C:\Users\ahlad\Downloads\bbc-text.csv.zip")
df
category | text | |
---|---|---|
0 | tech | tv future in the hands of viewers with home th... |
1 | business | worldcom boss left books alone former worldc... |
2 | sport | tigers wary of farrell gamble leicester say ... |
3 | sport | yeading face newcastle in fa cup premiership s... |
4 | entertainment | ocean s twelve raids box office ocean s twelve... |
... | ... | ... |
2220 | business | cars pull down us retail figures us retail sal... |
2221 | politics | kilroy unveils immigration policy ex-chatshow ... |
2222 | entertainment | rem announce new glasgow concert us band rem h... |
2223 | politics | how political squabbles snowball it s become c... |
2224 | sport | souness delight at euro progress boss graeme s... |
2225 rows × 2 columns
# Preprocess the text data
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
df['text'] = df['text'].apply(lambda x: ' '.join([ps.stem(word) for word in x.split() if word.lower() not in stop_words]))
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\ahlad\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['category'], test_size=0.2, random_state=42)
# Convert the text data into numerical form using TF-IDF vectorization
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
# Train a Naive Bayes classifier on the training data
clf = MultinomialNB()
clf.fit(X_train, y_train)
MultinomialNB()
# Evaluate the performance of the classifier on the testing data
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.9573033707865168 Precision: 0.9588842437695448 Recall: 0.9573033707865168 F1 Score: 0.9571039269250978