Detailed Explanation of Zero-width Assertive in Regular Expressions

Regular Expression Zero-Width Assertion:

Zero-width assertion is a difficult point in regular expressions, so this chapter focuses on analyzing it from the aspect of matching principles. Zero-width assertion also has other names, such as "lookaround" or "pre-search" etc., but these are not the focus of our attention.

1. Basic Concept:

Zero-width assertion, as the name suggests, is a zero-width match. The content matched by it will not be saved to the match result, and the final match result is just a position.
The function is to add a restriction condition at a specified position, which specifies that the character before or after this position must meet the restriction condition in order for the character expression in the regular expression to match successfully.
Note: The subexpression mentioned here is not limited to expressions enclosed in parentheses, but refers to any matching unit in the regular expression.
JavaScript only supports zero-width lookahead, which can be divided into positive zero-width lookahead and negative zero-width lookahead.

Code examples are as follows:

Example Code One:

var str="abZW863";
var reg=/ab(?=[A-Z])/;
console.log(str.match(reg));

The semantic meaning of the regular expression in the above code is: matching the string "ab" that is followed by any uppercase letter. The final match result is "ab" because of the zero-width assertion "(?=[A-Z])" does not match any character; it is used to specify that the character following the current position must be an uppercase letter.

Example Code Two:

var str="abZW863";
var reg=/ab(?![A-Z])/;
console.log(str.match(reg));

The semantic meaning of the regular expression in the above code is: matching the string "ab" that is not followed by any uppercase letter. The regular expression failed to match any character because in the string, "ab" is followed by an uppercase letter.

II. Matching principles:

The above code only introduces how zero-width assertion matches in a conceptual way.
Below, I will introduce how positive zero-width assertion and negative zero-width assertion match in terms of matching principles.
1.Positive zero-width assertion:
Code examples are as follows:

var str="<div>antzone";
var reg=/^(?=<)<[^>]+>\w+/;
console.log(str.match(reg));

The matching process is as follows:
First, the "^" in the regular expression takes control, starts matching from position 0, matches the start position 0, match is successful, and then the control is transferred to "(?=<)" is also zero-width, so "(?=<)" also starts matching from position 0, it requires that the position to the right must be the character "<", and the character to the right of position 0 is exactly the character "<", match is successful, and then the control is transferred to "<", since "(?=<)" is also zero-width, so it also starts matching from position 0, so the match is successful, and the subsequent matching process will not be introduced.

2.Negative zero-width assertion:

Code examples are as follows:

var str="abZW863ab88"; 
var reg=/ab(?![A-Z])/g; 
console.log(str.match(reg));

The matching process is as follows:
First, the character "a" of the regular expression takes control, starts matching from position 0, match character "a" successfully, and then the control is transferred to "b", from position1Start matching, match character "b" successfully, and then the control is transferred to "(?![A-Z])", it starts to match from the position2Start matching, it requires that the position to the right cannot be any uppercase letter, and the position to the right is the uppercase letter "Z", match fails, and then the control is returned to the character "a", and from the position1Start trying, match fails, and then the control is returned to the character "a", from the position2Start trying to match, still fail, and so on, until from the position7Start trying to match successfully, then the control is transferred to "b", and then from the position8Start trying to match, match is successful, and then the control is transferred to "(?![A-Z])", it starts to match from the position9Start trying to match from the position, it specifies that the position to the right cannot be an uppercase letter, match is successful, but it will not actually match the character, so the final match result is "ab".

The following is supplementary

Zero-width assertion is a method in regular expressions, and regular expressions in computer science refer to a single string used to describe or match a series of strings that conform to a certain syntactic rule.

Definition and explanation

Zero-width assertion is a method in regular expressions
Regular expressions in computer science refer to a single string used to describe or match a series of strings that conform to a certain syntactic rule. In many text editors or other tools, regular expressions are often used to search and/or replace the text content that matches a certain pattern. Many programming languages support string operations using regular expressions. For example, Perl has a powerful built-in regular expression engine. The concept of regular expressions was initially popularized by Unix tool software (such as sed and grep). Regular expressions are usually abbreviated as 'regex', the singular has 'regexp', 'regex', and the plural has 'regexps', 'regexes', 'regexen'.

Zero-width assertion

Used to find something before or after certain content (but not including that content), that is, they are like '\b', '^', '$' to specify a position that should meet certain conditions (i.e., assertions), so they are also called zero-width assertions. It is best to illustrate with examples: Assertions are used to declare a fact that should be true. Regular expressions only continue to match when the assertion is true.

(?=exp) also called zero-width positive lookahead assertion, it asserts that the position after its occurrence can match the expression 'exp'. For example, '\b(?=re)\w+\b matches the part of the word that starts with 're' (excluding 're' itself), such as when searching for 'reading a book.', it will match 'ading'.

var reg = new Regex(@"\w+(?=ing)");
var str = "muing";
Console.WriteLine(reg.Match(str).Value);//Returns mu

(?<=exp) also called zero-width positive lookbehind assertion, it asserts that the position before its occurrence can match the expression 'exp'. For example, '\b\w+(?<=ing\b) will match the first part of the word ending with 'ing' (excluding 'ing' itself), for example, when searching for 'I am reading.', it matches 'read'.

If you want to add a comma every three digits in a very long number (of course, starting from the right), you can search for the part to be added in front and inside: ((?=\d)\d{3})+\b, using it to1234567890 The search result is234567890.
The following example uses both of these assertions: (?<=\s)\d+(?=\s) matches numbers separated by whitespace (once again emphasized, not including these whitespace characters).

Negative zero-width assertion

We mentioned earlier how to find characters that are not a certain character or not in a certain character class (negation). But what if we just want to ensure that a character does not appear, but do not want to match it? For example, if we want to find such words--It contains the letter 'q', but it is not followed by the letter 'u'. We can try this:

\b\w*q[^u]\w*\b matches the word containing a letter 'q' followed by a non-letter 'u'. However, if you do more tests (or if you are sharp-witted enough to observe it directly), you will find that if 'q' appears at the end of a word, like 'Iraq', 'Benq', this expression will fail. This is because '[^u]' always matches a character, so if 'q' is the last character of a word, the following '[^u]' will match the word delimiter (which may be a space, or a period, or something else) following 'q', and the following '\w*\b will match the next word, so \b\w*q[^u]\w*It can match the entire 'Iraq fighting'. Negative zero-width assertions can solve this kind of problem because they match a position without consuming any characters. Now, we can solve this problem in this way: '\b\w*q(?!u)\w*\b).

Zero-width negative lookahead assertion (?!exp), asserts that the position after this cannot match the expression 'exp'. For example: \d{3})(?!\d) matches a three-digit number, and the three digits cannot be followed by another digit;\b((?!abc)\w)+\b matches a word that does not contain the consecutive string 'abc'.
Similarly, we can use (?<!exp), a zero-width negative lookbehind assertion to assert that the position before this cannot match the expression 'exp': (?<![a-z])\d{7} matches a seven-digit number that is not preceded by a lowercase letter.

A more complex example: (?<=<(\w+)>).*(?=<\/\1>) matches the content inside simple HTML tags without attributes. (<?=(\w+)>) specifies such a prefix: a word enclosed in angle brackets (such as ), followed by.*(any string), followed by a suffix (?=<\/\1>). Note the \/uses the character escaping mentioned earlier;\1is a backreference, referencing the first captured group, the (\w+) of the content, so if the prefix is actually , the suffix is ed. The entire expression matches and between the content (again, not including the prefix and suffix themselves).

The above is a bit headache. Let's add some supplements below

Assertions are used to declare a fact that should be true. In regular expressions, the match will only continue if the assertion is true.
The next four are used to find something before or after certain content (but not including the content itself), that is, they are like \b, ^, $, which are used to specify a position that should meet certain conditions (i.e., assertions), so they are also called zero-width assertions. It's best to illustrate with examples:

(?=exp) is also known as a zero-width positive lookahead assertion, which asserts that the position after itself can match the expression 'exp'. For example, \b\w+(?=ing\b), matches the part before a word ending with 'ing' (excluding 'ing' itself), such as when searching for 'I'm singing while you're dancing.', it matches 'sing' and 'danc'.
(?<=exp) is also known as a zero-width positive lookbehind assertion, which asserts that the position before itself can match the expression 'exp'. For example, (?<=\bre)\w+It matches the latter part of a word starting with 're' (excluding 're' itself), for example, when searching for 'reading a book', it matches 'ading'.

If you want to add a comma every three digits in a very long number (of course, starting from the right), you can search for the part to be added in front and inside: ((?<=\d)\d{3})*\b, using it to1234567890 The search result is234567890.
The following example uses both of these assertions: (?<=\s)\d+(?=\s) matches numbers separated by whitespace (once again emphasized, not including these whitespace characters).

Supplementary two:

Recently, in order to process the source code of html files, regular expression search and replacement are required. So, taking this opportunity, I want to systematically learn regular expressions, although I have used regular expressions before, each time it is just to get through temporarily. In the process of learning, I still encountered a lot of problems, especially zero-width lookahead (and I still want to complain here, there are all kinds of copied and pasted content on the Internet, and I have seen a lot of repeated things when encountering a problem, I am sweating!!!), so I write down my understanding here for easy reference in the future!

　　　　　 What is zero-width positive lookahead assertion? Let's see the official explanation definition on msdn

(?= sub-expression)

(Zero-width positive lookahead assertion.) It continues to match only when the sub-expression matches to the right of this position. For example, \w+(?=\d) matches words followed by numbers, but not the numbers themselves.

　　　　　 A classic example: a word ending with ing, to get the content before ing

var reg = new Regex(@"\w+(?=ing)");
var str = "muing";
Console.WriteLine(reg.Match(str).Value);//Returns mu

　　　　　 The above examples are everywhere on the Internet, and maybe you have understood by now that it is actually returning the content before the exp expression.

　　　　 Let's see the code below

var reg = new Regex(@"a(?=b)c");
var str = "abc";
Console.WriteLine(reg.IsMatch(str));//Returns false

　　　　　 Why does it return false?

　　　　 Actually, the official definition of msdn has already said, but it is said in an official way. Here we need to pay attention to a key point: this position. That's right, it is a position rather than a character. So, combined with the official definition and the first example, let's understand the second example:

　　　　 Because a is followed by b, the matching content a is returned at this time (as can be seen from the first example, only a is returned and the matching content of exp is not returned), at this time a(?=b)c63Part of the problem has been solved, and the next step is to solve the matching problem of c. At this point, where should the matching of c start from the string abc? According to the official definition, it starts from the position of the sub-expression to the right, so it starts from the position of b, but b does not match a(?=b)c the remaining part of c, so abc does not match a(?=b)c'd.

　　　　 Then how should the regular expression be written for the above match?

　　　　 The answer is: a(?=b)bc

　　　　 Of course, some people may say that abc matches directly, why do we need to do this? Of course, there is no need to do this, just to explain what zero-width positive lookahead assertion is all about? The principle of other zero-width assertions is the same!

Supplement three

(?=exp):Zero-width positive lookahead assertion, it asserts that the position after can match the expression exp.

#Match the suffix as _path, result is product
　 'product_path'.scan /(product)(?=_path)/

(?<=exp):Zero-width positive lookbehind assertion, it asserts that the position before can match the expression exp

#Match the prefix as name:, result is wangfei
'name:wangfei'.scan /(?<=name:)(wangfei)/ #wangfei

(?!exp):Zero-width negative lookahead assertion, it asserts that the position after cannot match the expression exp.

#Match the suffix that is not _path
'product_path'.scan /(product)(?!_path)/　 #nil
#Match the suffix that is not _url
'product_path'.scan /(product)(?!_url)/　 #product

(?<!exp):Zero-width negative lookbehind assertion to assert that the position before cannot match the expression exp

#Match the prefix that is not name:
'name:angelica'.scan /(?<!name:)(angelica)/　 #nil
#Match the prefix that is not nick_name:
'name:angelica'.scan /(?<!nick_name:)(angelica)/#angelica

The editor is tired of this thing, wait for something good to share, wash and sleep today

Basic Tutorial