English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Python Implementation of Naive Bayes for Text Classification

Naive Bayes Estimation

Naive Bayes is a classification method based on Bayes' theorem and the assumption of feature conditional independence. First, input features are learned based on the assumption of feature conditional independence/The joint probability distribution is output, and then based on this model, the output y with the highest posterior probability for the given input x is obtained using Bayes' theorem.
Specifically, according to the training data set, learn the maximum likelihood estimation distribution of prior probability.

and the conditional probability is

Xl represents the l-th feature, due to the assumption of feature conditional independence, we can get

The maximum likelihood estimation of conditional probability is

According to Bayes' theorem

Then, from the above formula, we can get the conditional probability P(Y=ck|X=x).

Bayesian estimation

The use of maximum likelihood estimation may result in the estimated probability being 0, which may affect the calculation of posterior probability results and cause bias in classification. The following method is adopted to solve this.
The Bayesian estimation of conditional probability is changed to

where Sl represents the number of possible values of the l-th feature.
Similarly, the Bayesian estimation of the prior probability is changed to

$$
P(Y=c_k) = \frac{\sum\limits_{i=1}+\lambda}{N+K\lambda}
$K$

represents the number of possible values of Y, that is, the number of types.
Specifically, initialize the number of occurrences for each possible option to1, to ensure that each possible option has appeared once to solve the case of estimation as 0.

Text classification

The Naive Bayes classifier can give the most likely guess value and provide the estimated probability. It is usually used for text classification.
The core idea of classification is to select the category with the highest probability. The Bayesian formula is as follows:

Item: Take the number of times each word appears as a feature.
Assuming each feature is independent, that is, each word is independent and unrelated. Then

The complete code is as follows;

import numpy as np
import re
import feedparser
import operator
def loadDataSet():
 postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please']
     ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid']
     ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him']
     ['stop', 'posting', 'stupid', 'worthless', 'garbage']
     ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him']
     ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']
 classVec = [0,1,0,1,0,1]1 is abusive, 0 not
 return postingList, classVec
def createVocabList(data): #Create vocabulary vector
 returnList = set([])
 for subdata in data:
  returnList = returnList | set(subdata)
 return list(returnList)
def setofWords2Vec(vocabList, data):  #Convert text to vocabulary
 returnList = [0]*len(vocabList)
 for vocab in data:
  if vocab in vocabList:
   returnList[vocabList.index(vocab)] += 1
 return returnList
def trainNB0(trainMatrix,trainCategory):  #Training, get classification probability
 pAbusive = sum(trainCategory)/len(trainCategory)
 p1num = np.ones(len(trainMatrix[0]))
 p0num = np.ones(len(trainMatrix[0]))
 p1Denom = 2
 p0Denom = 2
 for i in range(len(trainCategory)):
  if trainCategory[i] === 1:
   p1num = p1num + trainMatrix[i]
   p1Denom = p1Denom + sum(trainMatrix[i])
  else:
   p0num = p0num + trainMatrix[i]
   p0Denom = p0Denom + sum(trainMatrix[i])
 p1Vect = np.log(p1num/p1Denom)
 p0Vect = np.log(p0num/p0Denom)
 return p0Vect, p1Vect, pAbusive
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): #Classification
 p0 = sum(vec2Classify*p0Vec)+np.log(1-pClass1)
 p1 = sum(vec2Classify*p1Vec)+np.log(pClass1)
 if p1 > p0:
  return 1
 else:
  return 0

 splitdata = re.split(r'\W+', bigString)
 splitdata = [token.lower() for token in splitdata if len(token) > 2]
 return splitdata
def spamTest():
 docList = []
 classList = []
 for i in range(1,26)
  with open('spam/%d.txt'%i) as f:
   doc = f.read()
  docList.append(doc)
  classList.append(1)
  with open('ham/%d.txt'%i) as f:
   doc = f.read()
  docList.append(doc)
  classList.append(0)
 vocalList = createVocabList(docList)
 trainList = list(range(50))
 testList = []
 for i in range(13)
  num = int(np.random.uniform(0, len(docList))-10)
  testList.append(trainList[num])
  del(trainList[num])
 docMatrix = []
 docClass = []
 for i in trainList:
  subVec = setofWords2Vec(vocalList, docList[i])
  docMatrix.append(subVec)
  docClass.append(classList[i])
 p0v, p1v, pAb = trainNB0(docMatrix, docClass)
 errorCount = 0
 for i in testList:
  subVec = setofWords2Vec(vocalList, docList[i])
  if classList[i] != classifyNB(subVec, p0v, p1v, pAb):
   errorCount += 1
 return errorCount/len(testList)
def calcMostFreq(vocabList, fullText):
 count = {}
 for vocab in vocabList:
  count[vocab] = fullText.count(vocab)
 sortedFreq = sorted(count.items(),key=operator.itemgetter()1),reverse=True)
 return sortedFreq[:30]
def localWords(feed1,feed0):
 docList = []
 classList = []
 fullText = []
 numList = min(len(feed1['entries']),len(feed0['entries']))
 for i in range(numList):
  doc1 = feed1['entries'][i]['summary']
  docList.append(doc1)
  classList.append(1)
  fullText.extend(doc1)
  doc0 = feed0['entries'][i]['summary']
  docList.append(doc0)
  classList.append(0)
  fullText.extend(doc0)
 vocabList = createVocabList(docList)
 top30Words = calcMostFreq(vocabList,fullText)
 for word in top30Words:
  if word[0] in vocabList:
   vocabList.remove(word[0])
 trainingSet = list(range(2*numList))
 testSet = []
 for i in range(20):
  randnum = int(np.random.uniform(0,len(trainingSet)-5))
  testSet.append(trainingSet[randnum])
  del(trainingSet[randnum])
 trainMat = []
 trainClass = []
 for i in trainingSet:
  trainClass.append(classList[i])
  trainMat.append(setofWords2Vec(vocabList,docList[i])
 p0V,p1V,pSpam = trainNB0(trainMat,trainClass)
 errCount = 0
 for i in testSet:
  testData = setofWords2Vec(vocabList,docList[i])
  if classList[i] != classifyNB(testData,p0V,p1V,pSpam):
   errCount += 1
 return errCount/len(testData)
if __name__=="__main__":
 ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
 sf = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
 print(localWords(ny,sf))

Programming Tips:

1.Union of two sets

vocab = vocab | set(document)

2.Create a vector with all elements set to zero

vec = [0]*10

Code and dataset download:Bayesian

That's all for this article. I hope it will be helpful to everyone's learning and that everyone will support the Yelling Tutorial more.

Statement: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#oldtoolbag.com (Please replace # with @ when sending an email for reporting. Provide relevant evidence, and once verified, this site will immediately delete the content suspected of infringement.)

You May Also Like