English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
Naive Bayes Estimation
Naive Bayes is a classification method based on Bayes' theorem and the assumption of feature conditional independence. First, input features are learned based on the assumption of feature conditional independence/The joint probability distribution is output, and then based on this model, the output y with the highest posterior probability for the given input x is obtained using Bayes' theorem.
Specifically, according to the training data set, learn the maximum likelihood estimation distribution of prior probability.
and the conditional probability is
Xl represents the l-th feature, due to the assumption of feature conditional independence, we can get
The maximum likelihood estimation of conditional probability is
According to Bayes' theorem
Then, from the above formula, we can get the conditional probability P(Y=ck|X=x).
Bayesian estimation
The use of maximum likelihood estimation may result in the estimated probability being 0, which may affect the calculation of posterior probability results and cause bias in classification. The following method is adopted to solve this.
The Bayesian estimation of conditional probability is changed to
where Sl represents the number of possible values of the l-th feature.
Similarly, the Bayesian estimation of the prior probability is changed to
$$
P(Y=c_k) = \frac{\sum\limits_{i=1}+\lambda}{N+K\lambda}
$K$
represents the number of possible values of Y, that is, the number of types.
Specifically, initialize the number of occurrences for each possible option to1, to ensure that each possible option has appeared once to solve the case of estimation as 0.
Text classification
The Naive Bayes classifier can give the most likely guess value and provide the estimated probability. It is usually used for text classification.
The core idea of classification is to select the category with the highest probability. The Bayesian formula is as follows:
Item: Take the number of times each word appears as a feature.
Assuming each feature is independent, that is, each word is independent and unrelated. Then
The complete code is as follows;
import numpy as np import re import feedparser import operator def loadDataSet(): postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'] ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'] ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'] ['stop', 'posting', 'stupid', 'worthless', 'garbage'] ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'] ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid'] classVec = [0,1,0,1,0,1]1 is abusive, 0 not return postingList, classVec def createVocabList(data): #Create vocabulary vector returnList = set([]) for subdata in data: returnList = returnList | set(subdata) return list(returnList) def setofWords2Vec(vocabList, data): #Convert text to vocabulary returnList = [0]*len(vocabList) for vocab in data: if vocab in vocabList: returnList[vocabList.index(vocab)] += 1 return returnList def trainNB0(trainMatrix,trainCategory): #Training, get classification probability pAbusive = sum(trainCategory)/len(trainCategory) p1num = np.ones(len(trainMatrix[0])) p0num = np.ones(len(trainMatrix[0])) p1Denom = 2 p0Denom = 2 for i in range(len(trainCategory)): if trainCategory[i] === 1: p1num = p1num + trainMatrix[i] p1Denom = p1Denom + sum(trainMatrix[i]) else: p0num = p0num + trainMatrix[i] p0Denom = p0Denom + sum(trainMatrix[i]) p1Vect = np.log(p1num/p1Denom) p0Vect = np.log(p0num/p0Denom) return p0Vect, p1Vect, pAbusive def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): #Classification p0 = sum(vec2Classify*p0Vec)+np.log(1-pClass1) p1 = sum(vec2Classify*p1Vec)+np.log(pClass1) if p1 > p0: return 1 else: return 0 splitdata = re.split(r'\W+', bigString) splitdata = [token.lower() for token in splitdata if len(token) > 2] return splitdata def spamTest(): docList = [] classList = [] for i in range(1,26) with open('spam/%d.txt'%i) as f: doc = f.read() docList.append(doc) classList.append(1) with open('ham/%d.txt'%i) as f: doc = f.read() docList.append(doc) classList.append(0) vocalList = createVocabList(docList) trainList = list(range(50)) testList = [] for i in range(13) num = int(np.random.uniform(0, len(docList))-10) testList.append(trainList[num]) del(trainList[num]) docMatrix = [] docClass = [] for i in trainList: subVec = setofWords2Vec(vocalList, docList[i]) docMatrix.append(subVec) docClass.append(classList[i]) p0v, p1v, pAb = trainNB0(docMatrix, docClass) errorCount = 0 for i in testList: subVec = setofWords2Vec(vocalList, docList[i]) if classList[i] != classifyNB(subVec, p0v, p1v, pAb): errorCount += 1 return errorCount/len(testList) def calcMostFreq(vocabList, fullText): count = {} for vocab in vocabList: count[vocab] = fullText.count(vocab) sortedFreq = sorted(count.items(),key=operator.itemgetter()1),reverse=True) return sortedFreq[:30] def localWords(feed1,feed0): docList = [] classList = [] fullText = [] numList = min(len(feed1['entries']),len(feed0['entries'])) for i in range(numList): doc1 = feed1['entries'][i]['summary'] docList.append(doc1) classList.append(1) fullText.extend(doc1) doc0 = feed0['entries'][i]['summary'] docList.append(doc0) classList.append(0) fullText.extend(doc0) vocabList = createVocabList(docList) top30Words = calcMostFreq(vocabList,fullText) for word in top30Words: if word[0] in vocabList: vocabList.remove(word[0]) trainingSet = list(range(2*numList)) testSet = [] for i in range(20): randnum = int(np.random.uniform(0,len(trainingSet)-5)) testSet.append(trainingSet[randnum]) del(trainingSet[randnum]) trainMat = [] trainClass = [] for i in trainingSet: trainClass.append(classList[i]) trainMat.append(setofWords2Vec(vocabList,docList[i]) p0V,p1V,pSpam = trainNB0(trainMat,trainClass) errCount = 0 for i in testSet: testData = setofWords2Vec(vocabList,docList[i]) if classList[i] != classifyNB(testData,p0V,p1V,pSpam): errCount += 1 return errCount/len(testData) if __name__=="__main__": ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss') sf = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss') print(localWords(ny,sf))
Programming Tips:
1.Union of two sets
vocab = vocab | set(document)
2.Create a vector with all elements set to zero
vec = [0]*10
Code and dataset download:Bayesian
That's all for this article. I hope it will be helpful to everyone's learning and that everyone will support the Yelling Tutorial more.
Statement: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#oldtoolbag.com (Please replace # with @ when sending an email for reporting. Provide relevant evidence, and once verified, this site will immediately delete the content suspected of infringement.)