English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
Introduction
With the development of the World Wide Web and the arrival of the big data era, a large amount of digital information is produced, stored, transmitted, and transformed every day. How to find the information that meets one's needs from a large amount of information in a certain way, make it orderly, and make good use of it has become a major challenge. Full-text retrieval technology is the most common information query application nowadays. The core principle of the full-text retrieval technology to be realized in this article is the same as that used in search engines in daily life and in information search on blogs and forums. With the realization of document information digitization, the effective storage and timely and accurate extraction of information are the basic things that every company, enterprise, and unit needs to do well. There are many mature theories and methods for full-text retrieval in English. The open-source full-text retrieval engine Lucene is a sub-project of the Apache Software Foundation's Jakarta project group. Its purpose is to provide software developers with a simple and easy-to-use toolkit to facilitate the implementation of full-text retrieval functions in the target system. Lucene does not support Chinese, but there are now many open-source Chinese segmenters that can index Chinese content. Based on the study of the core principles of Lucene, this article has realized the crawling and retrieval of both Chinese and English web pages.
1 Introduction to Lucene
1.1 Lucene introduction
Lucene is a full-text search engine toolkit written in Java, which implements the two core functions of index construction and search, and both are independent of each other, which makes it easy for developers to expand. Lucene provides a rich API that can interact with information stored in the index conveniently. It should be noted that it is not a complete full-text search application, but provides indexing and search functions for applications. That is, if you want Lucene to truly take effect, some necessary secondary development needs to be done on its basis.
The structure design of Lucene is relatively similar to that of databases, but Lucene's index is greatly different from that of databases. Both databases and Lucene build indexes for the convenience of searching, but databases only need to establish indexes for some fields and need to convert the data into formatted information and save it. While full-text search indexes all information in a certain way. The differences and similarities between the two searches are shown in the table1-1As shown.
Table1-1: Comparison between database retrieval and Lucene retrieval
Comparison item |
Lucene retrieval |
Database retrieval |
Data retrieval |
Retrieve from Lucene's index file |
Retrieve records from the database index |
Index structure |
Document (document) |
Record (record) |
Query results |
Hit: documents that meet the relationship are composed |
Query result set: records containing keywords are composed |
Full-text search |
Supported |
Not supported |
Fuzzy query |
Supported |
Not supported |
Result sorting |
Set weights and perform relevance sorting |
Cannot be sorted |
1.2 Lucene overall structure
The release form of the Lucene software package is a JAR file, with relatively fast version updates and large version gaps, this article uses5.3.1Version, mainly used sub-packages are as follows1-2As shown.
Table1-2: Sub-package and function
Package name |
Function |
Org .apache.lucene .analysis |
Tokenization |
Org .apache.lucene .document |
Documents for index management |
Org .apache.lucene .index |
Index operations, including addition, deletion, etc. |
Org .apache.lucene .queryParser |
Query builder, constructs retrieval expressions |
Org .apache.lucene .search |
Retrieval management |
Org .apache.lucene .store |
Data storage management |
Org .apache.lucene .util |
Public class |
1.3 Lucene architectural design
Lucene is very powerful, but fundamentally, it mainly includes two parts: one is to index the text content after splitting the words; the other is to return the results according to the query conditions, that is, the establishment of index and query.
As shown1-1As shown, this paper focuses on the external interface and information source, and emphasizes the indexing and querying of the text content of web crawling.
Figure1-1: Lucene's architectural design
2 JDK installation and environment variable configuration
1. JDK download:
Download the compressed package that matches the system version from the oracle official website. Click to install, follow the prompts to install, and click Yes during the installation process to install jre.
http://www.oracle.com/technetwork/java/javase/downloads/index.html
2. Set environment variables:
(1) Right-click Computer=》Properties=》Advanced System Settings=》Environment Variables=》System Variables=》New=》JAVA_HOME: Installation path
(2Add to Path in %JAVA_HOME%\bin
3. Test whether it is successful:
Start=》Run=》CMD Enter in the popped-up DOS window
Input: java -version will show version information,
Input: javac appears usage information for javac
As shown in the figure2-1As shown, it is successful.
Figure2-1: Test java configuration with cmd command box
3 Write Java code to obtain web content
Because Lucene uses different tokenizers for different languages, the standard tokenizer is used for English, and the smartcn tokenizer is selected for Chinese. When obtaining web pages, the web page is first obtained and saved as an html file. Due to the interference of tags in html, it will affect the search results, so the html tags need to be removed, and the text content is converted to txt file for saving. In addition to the tokenizer, Chinese and English are basically the same, so the subsequent code and experimental results demonstration will choose either one. This paper takes the web pages of fifty Chinese and English stories as examples.
The specific code design is shown as follows: Url2Html.java saves the web page of the input URL as an html file, Html2Txt.java implements the removal of html document tags and saves it as a txt document. The specific code is shown as follows3-1and3-2.
public void way(String filePath,String url) throws Exception{ File dest = new File(filePath);//Create a file InputStream is;//Receive byte input stream FileOutputStream fos = new FileOutputStream(dest);//Byte output stream URL wangzhi = new URL(url);//Set the URL is = wangzhi.openStream(); BufferedInputStream bis = new BufferedInputStream(is);//Add buffer to byte input stream BufferedOutputStream bos = new BufferedOutputStream(fos);//Add buffer to byte output stream /* * Read bytes */ int length; byte[] bytes = new byte[1024*20]; while((length = bis.read(bytes, 0, bytes.length)) != -1{ fos.write(bytes, 0, length); } /* * Close the buffered stream and input/output stream */ bos.close(); fos.close(); bis.close(); is.close(); }
public String getBody(String val){ String zyf = val.replaceAll("<",/?[^>]+>", ""); //Remove the <html> tag return zyf; } public void writeTxt(String Str, String writePath) { File writename = new File(writePath); try { writename.createNewFile(); BufferedWriter out = new BufferedWriter(new FileWriter(writename)); out.write(Str); out.flush(); out.close(); } catch (IOException e) { e.printStackTrace(); } }
Taking the web page of the fairy tale "Dumb Wolf Going to School" as an example, the document path is set to "E:\work \lucene \test \data \html" and "E:\work\lucene\test\data\txt", and two parameters need to be set for each web page read: file naming filename and obtaining the target URL url. A new main function is created to call the two methods. The specific implementation is shown in the figure3-3As shown:
public static void main(String[] args) { String filename = "jingdizhi";//File name String url = "http://www.51test.net/show/8072125.html";//The URL of the web page to be crawled String filePath = "E:\\work\\lucene\\test\\data\\html\\"+filename+.html;//Write the file path of html+Filename String writePath = "E:\\work\\lucene\\test\\data\\txt\\"+filename+.txt;//Write the file path of txt+Filename Url2Html url2html = new Url2Html(); try { url2html.way(filePath,url); } catch (Exception e) { e.printStackTrace(); } Html2Txt html2txt = new Html2Txt(); String read=html2txt.readfile(filePath);//Read HTML file String txt = html2txt.getBody(read);//Remove HTML tags System.out.println(txt); try { html2txt.writeTxt(txt,writePath); } catch (Exception e) { e.printStackTrace(); } }
After executing the program, 'bainiao xue shang.html' and 'bainiao xue shang.txt' are established in two folders respectively.
4 Establishing index
The basic principles of indexing and querying are as follows:
Establishing index: The index of the search engine is actually the implementation of 'word-The specific data structure of 'document matrix'. It is also the first step in full-text retrieval, lucene provides IndexWriter class for index management, mainly including add(), delete(), update(). There is also the setting of weight values, by setting different index weight values, it can be returned according to the relevance size during the search.
Perform search: the original direct search is a sequential search for documents, after the index is established, the position of the index word in the document can be found by searching the index, and then the position and word of the index item in the document can be returned. Lucene provides IndexSearcher class for document retrieval, retrieval forms are mainly divided into two categories, the first is Term, for single word item retrieval; the second is Parser, which can customize the construction of retrieval expressions, with many retrieval forms, specific methods will be demonstrated later.
4.1 Experiment environment
This PC uses windows 10x64system,8G memory,256G solid-state hard disk. The development environment is Myeclipse 10, jdk version is1.8. During the experiment, due to the transformation of some syntax, several classes adopt1.6version implementation.
4.2 Establishing index
Establishing the index library is adding one index record after another to the index library, Lucene provides an interface for adding an index record.
Mainly used 'write indexer', 'document', 'field' this3 classes. To establish an index, it is first necessary to construct a Document document object, determine the fields of Document, which is similar to the establishment of table structure in relational databases, Document is equivalent to a record row in a table, and fields are equivalent to columns in a row. In Lucene, for different fields' properties and data output requirements, different indexing can also be selected for fields./Storage field rules, in this experiment, the filename fileName, file path fullPath, and text content content are treated as fields of Document.
IndexWriter is responsible for receiving new documents and writing them into the index library. When creating the 'write indexer' IndexWriter, it is necessary to specify the language analyzer used. Establishing the index is divided into two categories: the first is unweighted index; the second is weighted index.
public Indexer(String indexDir) throws Exception{ Directory dir = FSDirectory.open(Paths.get(indexDir)); Analyzer analyzer = new StandardAnalyzer(); // Standard tokenizer //SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(analyzer); writer = new IndexWriter(dir, iwc); }
Set index fields, Store indicates whether to store the index content: fileName and fullPath occupy less memory and can be stored for easy query return.
private Document getDocument(File f) throws Exception { Document doc = new Document(); doc.add(new TextField("contents", new FileReader(f))); doc.add(new TextField("fileName", f.getName(), Store.YES)); doc.add(new TextField("fullPath", f.getCanonicalPath(), Store.YES));//Path index return doc; }
After executing the main code, the result is as shown in the figure: when designing an index for a file, the file "Index File:" is returned+File path
4.3 Deletion and modification of the index
Operations on databases generally include CRUD (Create, Read, Update, Delete), where creation refers to the selection and establishment of index items. The query, as a core function, will be discussed later. Here, we mainly record the methods used for deleting and updating indexes.
Deletion is divided into two types, including ordinary deletion and complete deletion. Since the deletion of the index affects the entire database, and for large systems, deleting the index means making changes to the underlying system, which is time-consuming and labor-intensive and cannot be undone. As mentioned earlier, when an index is created, several small files are generated, and when searching, these files are merged and searched. Ordinary deletion only marks the previously established index simply, making it impossible to search and return. Complete deletion, on the other hand, destroys the index and cannot be undone.1Take the index as an example:
Normal deletion (deleted before merging):
writer.deleteDocuments(new Term("id","1")); writer.commit();
Complete deletion (deleted after merging):
writer.deleteDocuments(new Term("id","1")); writer.forceMergeDeletes(); // Forced deletion writer.commit();
The principle of modifying the index is relatively simple, which is to implement coverage on the basis of the existing index, just like the code for adding the index mentioned above. No further elaboration is made here.
4.4 Index weighting
Lucene defaults to sorting by relevance, and Lucene provides a Boosting parameter that can be set for Field. This parameter is used to indicate the importance of the record. When meeting the search conditions, higher importance records are given priority, and the results are returned earlier. If there are many records, records with lower weights will be ranked after the first page. Therefore, the weighting operation of the index is an important factor affecting the satisfaction of the returned results. When designing information systems in practice, strict weight calculation formulas should be established to facilitate the modification of Field weights and better meet user needs.
For example, search engines give higher weights to web pages with high click-through rates and incoming and outgoing links, and rank them on the first page when returned. The implementation code is as follows4-1As shown, the comparison of unweighted and weighted results is as follows4-2As shown.
TextField field = new TextField("fullPath", f.getCanonicalPath(), Store.YES); if("A GREAT GRIEF.txt".equals(f.getName())){ field.setBoost(2.0f);//Weight the fullPath path of the file named 'secondry story.txt'; } //The default weight is1.0, changed to1.2That is, to increase the weight. doc.add(field);
Figure4-1: Index weighting
Figure4-2: Before weighting
Figure4-2: After weighting
As shown in the figure4-2The results show that without weighting, the returned results are sorted in dictionary order, so 'first' comes before 'secondry'. After weighting the file path named 'secondry story.txt', the order changes when returned, testing the weight of the weight.
5 Perform a query
The search interface of Lucene is mainly composed of QueryParser, IndexSearcher, and Hits.3 These classes constitute, QueryParser is the query parser, responsible for parsing the user's submitted query keywords. When creating a new parser, it needs to specify the field to be parsed and the language analyzer to be used. The language analyzer used here must be the same as the analyzer used when the index library was established; otherwise, the query results will be incorrect. IndexSearcher is the index search engine. When instantiating IndexSearcher, it needs to specify the directory of the index library. IndexSearcher has a search method to perform index retrieval, which accepts Query as a parameter and returns Hits. Hits is a collection of sorted query results, and the elements of the collection are Document. Information corresponding to this document can be obtained through the get method of Document, such as: file name, file path, file content, etc.
5.1 Basic query
As shown, there are mainly two ways to query, but it is recommended to use the first way to construct the QueryParser expression, which can have flexible combination methods, including boolean logic expressions, fuzzy matching, etc., but the second Term can only be used for vocabulary queries.
1. Construct QueryParser query expression:
QueryParser parser = new QueryParser("fullPath", analyzer); Query query = parser.parse(q);
2. Query for specific items:
Term t = new Term("fileName", q); Query query = new TermQuery(t);
The query results are shown in the figure5-1As shown: Taking the query file name fileName containing "large" as an example.
Figure5-1: "Large" query results
5.2 Fuzzy query
When constructing QueryParser, precise matching and fuzzy matching can be achieved by modifying the word item q. Fuzzy matching is modified by adding "~" after "q". As shown in the figure5-2As shown:
Figure5-2: Fuzzy matching
5.3 Conditional query
Boolean logic query and fuzzy query only need to change the query word q, while the限定条件查询 need to set the query expression, mainly divided into the following categories:
The specified item range search, specified number range, specified string start, and multi-condition query are listed below. The true parameter indicates whether the upper and lower limits are included.
Specified item range:
TermRangeQuery query = new TermRangeQuery("desc", new BytesRef("b".getBytes()), new BytesRef("c".getBytes()), true, true);
Specify numeric range:
NumericRangeQuery<Integer> query=NumericRangeQuery.newIntRange("id", 1, 2, true, true);
Specify string starting with:
PrefixQuery query=new PrefixQuery(new Term("city","a"));
Multi-condition query:
NumericRangeQuery<Integer>query1=NumericRangeQuery.newIntRange("id", 1, 2, true, true); PrefixQuery query2=new PrefixQuery(new Term("city","a")); BooleanQuery.Builder booleanQuery=new BooleanQuery.Builder(); booleanQuery.add(query1,BooleanClause.Occur.MUST); booleanQuery.add(query2,BooleanClause.Occur.MUST);
5.4 Highlighting search
When searching on Baidu, Google, and other search engines, the returned web pages will display the search keywords in red and also show the summary, that is, the content containing the keywords is extracted and returned. Highlighting search means changing the style of the keywords, and in this experiment conducted in myeclipse, the returned results will not have any style changes, but only the keywords of the returned content will be added with HTML tags, and if displayed on the web page, it will produce style changes.
The code for highlighting is shown as5-3As shown, the result is as follows5-4As shown, the matching words in Nanjing will be added with <b> and <font> tags, and displayed on the web page as bold and red.
QueryScorer scorer=new QueryScorer(query); Fragmenter fragmenter=new SimpleSpanFragmenter(scorer); SimpleHTMLFormatter simpleHTMLFormatter=new SimpleHTMLFormatter("<b><font color='red'>","</font></b> Highlighter highlighter=new Highlighter(simpleHTMLFormatter, scorer); highlighter.setTextFragmenter(fragmenter);
Figure5-3: Highlighting settings
Figure5-4: Highlighting the results
6 Problems and shortcomings encountered in the experiment
Lucene version updates quickly, and there needs to be a good connection between jdk version, eclipse version, and lucene version, otherwise many incompatibilities will occur, and debugging version as well as jdk1.6and jdk1.8selection, such as the append method in web crawling in1.8version has been deleted and cannot be used. However, the reading of document path FSDirectory.open() requires jdk1.8only supports.
The shortcomings of this experiment are mainly reflected in:
The flexibility of the code is low, and manual operations are required when crawling web pages, and both Chinese and English need to be processed separately. It should be improved so that the language of the web page can be judged, and then different segmenters can be automatically selected to execute.
The reusability of the code is low, without a relatively reasonable classification and construction of methods. For simplicity, basic comments and marks are made on several core codes to achieve the effect, which needs to be improved.
The portability of the code is low, the crawling of web pages uses jdk1.6version, Lucene's implementation uses jdk1.8version, when exported to other machines, needs to be slightly modified and configured, and cannot be realized with one-click operation.
7 Summary
This article starts from the principle of Lucene and understands the ideas and methods of full-text search, and experiments and tests the commonly used functions. In the process of the experiment, I understand the principle of search engines, and have a better practical experience based on the content of the information retrieval course. Lucene is an excellent open-source full-text search technology framework. Through in-depth study, I am more familiar with its implementation mechanism. In the process of studying it, I have learned a lot of object-oriented programming methods and ideas. Its good system framework and scalability are worth learning and referencing.
Statement: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#oldtoolbag.com (Please replace # with @ when sending an email for reporting, and provide relevant evidence. Once verified, this site will immediately delete the infringing content.)