
Hello again, machine learning basically has two types of problems in supervised learning algorithms, classification problems, and regression problems. Classification mainly deals with the problems where the input datasets are determined to be in a given set. The data set will be represented with discrete values/the whole numbers. Whereas regression deals with the dataset which is continuous in nature. Data set is represented in either whole number or fractions.
With that said and as per the title this is mainly concerned with the classification problems. This is a beginner level exercise. It gives a start to explore more. We will look into Multinomial Naive Bayes computation for this classification problem. We will look into a very basic problem, i.e. to classify if a given item is a vegetable or a fruit. Dataset is prepared by myself. Ready-made data doesn't seem good to me for learning.
Steps involved are, step1: load data, step2: process data, step3 divide the data into test and train, step4: import the required classification library, step5: Analyse the result with Accuracy for any tuning and step6: Predict the new data.
We will discuss a bit on the process data: In this classification problem, we will consider the last letter of the item to predict the item. example: carrot -> t is mapped to vegetables, mango -> o is mapped to fruits. In this way. This can be enhanced the way it is required. To keep it simple, I will go with the last letter.
Time for coding, Step 1:
#Step 1
import pandas as pd
veg = pd.read_csv("vegetables.csv")
frt = pd.read_csv("fruits.csv")vegetables = [(v[0], "vegetable") for v in veg.values]
fruits = [(f[0], "fruit") for f in frt.values]
items = vegetables + fruitsimport random
random.shuffle(items)
Here we have used pandas to read the individual CSV file and then shuffling it to be a random collection of data.
#Step 2
import numpy as np
items = np.array(items)X = [item[-1] for item in items[:, 0]]# labellising the last letter
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(X)
X2 = lb.transform(X)y = np.where(items[:, 1] == "vegetable",0,1)
Here we are preprocessing the data. We are now considering only the last letter to identify the item whether it is fruit or a vegetable. And using LabelBinarizer convert alphabet to numerical values. And mapping vegetables to 0 and fruits to 1.
#Step 3
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X2, y, test_size=0.4, random_state=42)
This will split the data into test and train data. This data set will be used to train the classifier and predict it o the test data and later measure the accuracy with the actual output value.
#Step 4
from sklearn.naive_bayes import MultinomialNB#y = alpha + beta * X
clf = MultinomialNB(alpha=0.1, fit_prior=True)
clf.fit(X_train, y_train)
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
MultinomialNB is used here as the classifier. Multinomial accepts only numerical values and this is why converting the alphabets to the numerical value.
#Step 5
from sklearn.metrics import accuracy_score, confusion_matrixprint(accuracy_score(y_train_pred, y_train))
print(accuracy_score(y_test_pred, y_test))CC=confusion_matrix(y_test,y_test_pred)
print(CC)
Accuracy measurement is a good to know before we start using any classifier. The confusion matrix gives the intuition of your classifier. For how many cases it has predicted wrong and right. Diagonal elements should be maximum to be a very good classifier.
#Step 6
#Client
predict_item = input("Enter your item : ")
#predict_item = "mango"
letter = predict_item[-1]
labled_letter = lb.transform([letter])
print(labled_letter)
print(clf.predict(labled_letter))
Finally the client code. Once all the above step 5 are designed Classifier is ready to predict the new data.
The code can be found on the GitHub link. Steps will be similar to the above and this can be tried in many different applications. Like classifying the names to be male or females. There is a data/set in the NLTK library that can be downloaded and follow the above steps. Claps are much appreciated. Suggestions are welcomed.