Learning Python Text Feature Extraction and Vectorization Algorithms

This article shares the specific code for Python text feature extraction and vectorization for everyone's reference, as follows

Suppose we have just watched Nolan's big movie 'Interstellar', and imagine how to let the machine automatically analyze the reviews of the movie by the audience, whether it is 'praise' (positive) or 'scold' (negative)?

This type of problem belongs to sentiment analysis problems. The first step in handling this type of problem is to convert text into features.

Therefore, in this chapter, we only learn the first step, how to extract features from text and vectorize them.

Since the processing of Chinese involves segmentation issues, this article uses a simple example to illustrate how to use Python's machine learning library to extract features from English.

1、data preparation

Python's sklearn.datasets supports reading all classified text from directories. However, the directories must be arranged according to the rule of one folder per label name. For example, the dataset used in this article has2tags, one is 'net', one is 'pos', and each directory has6The directory is as follows:

neg
　　　 1.txt
　　　 2.txt
　　　 ......
pos
　　　 1.txt
　　　 2.txt
　　　 ....

12The content of the file is summarized as follows:

neg: 
  shit. 
  waste my money. 
  waste of money. 
  sb movie. 
  waste of time. 
  a shit movie. 
pos: 
  nb! nb movie! 
  nb! 
  worth my money. 
  I love this movie! 
  a nb movie. 
  worth it!

2Text Features

How to extract sentiment from these English texts and classify them?

The most direct method is to extract words. It is generally believed that many keywords can reflect the speaker's attitude. For example, in the above simple dataset, it is easy to find that those who say 'shit' must belong to the neg class.

Of course, the above dataset is designed simply for convenience of description. In reality, a word often has a ambiguous attitude. However, there is still reason to believe that the more a word appears in the neg class, the greater the probability that it represents a neg attitude.
Similarly, we notice that some words are meaningless for sentiment classification. For example, the words 'of', 'I' and so on in the above data. These words have a name, calledStop_Word”(stop words). These words can be completely ignored and not counted. It is obvious that ignoring these words can optimize the storage space of word frequency records and speed up the construction.
Taking the frequency of each word as an important feature also has a problem. For example, the word 'movie' in the above data, in12of the samples appeared5times, but appears almost the same number of times on both sides, so there is no distinction.2times, but only appears in the pos class, which is obviously more strongly colored, that is, has a high degree of distinction.

Therefore, we need to introduceTF-IDF(Term Frequency-Inverse Document Frequency,Term Frequency and Inverse Document Frequency) to consider each word further.

TF (Term Frequency)The calculation is very simple, just the frequency of a word Nt appearing in a document t. For example, in the document 'I love this movie', the TF of the word 'love' is1/4. If we remove the stop words 'I' and 'it', then1/2.

IDF (Inverse Document Frequency)The meaning is, for a word t, the proportion of documents in which the word appears, Dt, to the total number of test documents D, and then take the natural logarithm.
For example, the word 'movie' appears5times, while the total number of documents is12, therefore, IDF is ln(5/12).
It is obvious that IDF is designed to highlight words that appear less frequently but have a strong emotional color. For example, the IDF of words like 'movie' is ln(12/5)=0.88is much less than the IDF of 'love' = ln(12/1)=2.48.

TF-IDFIt is simply a multiplication of the two. In this way, we can calculate the TF of each word in each document-IDF is the text feature value we extract.

3Vectorization

With the above foundation, we can vectorize the document. Let's first look at the code and then analyze the significance of vectorization:

# -*- coding: utf-8 -*- 
import scipy as sp 
import numpy as np 
from sklearn.datasets import load_files 
from sklearn.cross_validation import train_test_split 
from sklearn.feature_extraction.text import TfidfVectorizer 
'''''Load dataset, split dataset80% for training20% for testing 
movie_reviews = load_files('endata')  
doc_terms_train, doc_terms_test, y_train, y_test\ 
  = train_test_split(movie_reviews.data, movie_reviews.target, test_size = 0.3) 
'''''Vector Space Model under BOOL features, note that the test sample calls the transform interface''' 
count_vec = TfidfVectorizer(binary = False, decode_error = 'ignore',\ 
              stop_words = 'english' 
x_train = count_vec.fit_transform(doc_terms_train) 
x_test = count_vec.transform(doc_terms_test) 
x    = count_vec.transform(movie_reviews.data) 
y    = movie_reviews.target 
print(doc_terms_train) 
print(count_vec.get_feature_names()) 
print(x_train.toarray()) 
print(movie_reviews.target)

The running result is as follows:
["waste of time.", "a shit movie.", "a nb movie.", "I love this movie!", "shit.", "worth my money.", "sb movie.", "worth it!"]
['love', 'money', 'movie', 'nb', 'sb', 'shit', 'time', 'waste', 'worth']
[[ 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　 0.70710678　 0.70710678　 0.　　　　　　　 ]
　[ 0.　　　　　　　　　 0.　　　　　　　　　 0.60335753　 0.　　　　　　　　　 0.　　　　　　　　　 0.79747081　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　 ]
　[ 0.　　　　　　　　　 0.　　　　　　　　　 0.53550237　 0.84453372　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　 0.　　　　　　　　　 0.　　　　　　　 ]
　[ 0.84453372　 0.　　　　　　　　　 0.53550237　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　 0.　　　　　　　　　 0.　　　　　　　 ]
　[ 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 1.　　　　　　　　　 0.　　 0.　　　　　　　　　 0.　　　　　　　 ]
　[ 0.　　　　　　　　　 0.76642984　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　 0.　　　　　　　　　 0.64232803]
　[ 0.　　　　　　　　　 0.　　　　　　　　　 0.53550237　 0.　　　　　　　　　 0.84453372　 0.　　　　　　　　　 0.　　 0.　　　　　　　　　 0.　　　　　　　 ]
　[ 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　　　　　　　　 0.　　 0.　　　　　　　　　 1.　　　　　　　 ]]
[1 1 0 1 0 1 0 1 1 0 0 0]

Python's output is quite confused. Here is a table I made:

From the table above, we can find the following points:

1Filtering stop words.

When initializing count_vec, we passed stop_words = 'english' in the count_vec constructor, indicating the use of the default English stop words. You can use count_vec.get_stop_words() to view all the stop words built into TfidfVectorizer. Of course, you can also pass your own stop word list (such as 'movie' here).

2、TF-IDF calculation on the basic word frequency statistics of the latter.

The calculation of word frequency here uses sklearn's TfidfVectorizer. This class inherits from CountVectorizer and adds functions such as TF-IDF and other functions.
We will find that the results calculated here are not quite the same as those calculated before. Because count_vec was constructed by default with max_df=1, so TF-IDF have been normalized to constrain all values within [0,1between]

3The result of count_vec.fit_transform is a huge matrix. We can see that there are a lot of 0s in the table above, so sklearn uses sparse matrices internally. This example data is small. If readers are interested, they can try the real data used by machine learning researchers, which comes from Cornell University:http://www.cs.cornell.edu/people/pabo/movie-review-data/. This website provides many datasets, including several2database with positive and negative examples7around. Such a data scale is not large1minutes can still be completed, and it is recommended to try it. However, attention should be paid to the fact that these datasets may have illegal character issues. Therefore, decode_error = 'ignore' was passed to count_vec to ignore these illegal characters.

The results in the table above are the training8samples8result of a feature. This result can be classified using various classification algorithms.

That's all for this article. I hope it will be helpful to everyone's learning, and I also hope everyone will support the Shouting Tutorial more.

Declaration: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume any relevant legal responsibility. If you find any content suspected of copyright infringement, please send an email to: notice#oldtoolbag.com (Please replace # with @ when sending an email for reporting, and provide relevant evidence. Once verified, this site will immediately delete the content suspected of infringement.)

Basic Tutorial