What are Regular Expressions • Project Structure • How this Parser Works • Supported Regex Features • How to Build • How to Use (Code Examples)
Regex Parsing with a Lexer, Parser & Matcher. Supporting: ".*+?()[]^-|" with an Abstract Syntax Tree, Alternation & Repetition Operators
This project implements a regular expression parser and matching engine in C++. It takes a regex pattern, parses it into an Abstract Syntax Tree (AST), and provides functionality to match the pattern against input strings.
Regular Expressions (regex or regexp) are sequences of characters that define a search pattern. They are used to match character combinations in strings.
Key Concepts:
- Literals: Ordinary characters match themselves (e.g.,
a
matches "a"). - Metacharacters: Characters with special meanings:
.
(Dot): Matches any single character (except newline).*
(Star): Matches the preceding element zero or more times.+
(Plus): Matches the preceding element one or more times.?
(Question Mark): Matches the preceding element zero or one time.|
(Pipe): Acts like a boolean OR. Matches the expression before or after the pipe.()
(Parentheses): Groups expressions and captures matched text.[]
(Brackets): Defines a character set. Matches any one character within the brackets.^
(Caret inside[]
): Negates the character set. Matches any character not in the brackets.-
(Dash inside[]
): Defines a range of characters (e.g.,[a-z]
).
^
(Caret outside[]
): Matches the beginning of the string/line. (Note: This parser might use^
only for negation within brackets based on theregex::TokenType
).
- Quantifiers: Specify how many times an element must appear (
*
,+
,?
). - Character Classes: Predefined sets like
\d
(digits),\w
(word characters),\s
(whitespace). (Note: These might not be implemented in this specific parser). - Anchors: Specify positions like beginning (
^
) or end ($
) of the string. (Note:$
might not be implemented).
CMakeLists.txt
: Build configuration file for CMake.main.cpp
: Example usage of the regex library.regex_parser.h
: Header file defining the classes and structures (Lexer
,Parser
,ASTNode
,Regex
,Token
,TokenType
).regex_parser.cpp
: Implementation of the regex parsing and matching logic.README.md
: This file.build/
: Directory containing build artifacts generated by CMake.
- Lexing: The input regex pattern string is processed by the
regex::Lexer
class. It scans the pattern and converts it into a sequence ofregex::Token
s. Each token represents a part of the regex, like a literal character, a metacharacter (*
,+
,?
,.
,|
), or grouping symbols ((
,)
,[
,]
). The types are defined in theregex::TokenType
enum. - Parsing: The
regex::Parser
class takes the stream of tokens from the lexer. It uses recursive descent parsing to build an Abstract Syntax Tree (AST) represented byregex::ASTNode
objects. The AST hierarchically represents the structure and meaning of the regular expression. - Matching: The
regex::Regex
class holds the root of the generated AST. Itsmatch
method (and potentiallyfindAll
) uses the AST to determine if a given input string conforms to the pattern defined by the regex. (The exact matching algorithm, likely involving state machines or recursive traversal of the AST, is implemented inregex_parser.cpp
).
Based on regex::TokenType
, this parser supports:
- Literal characters (
regex::TokenType::CHAR
) - Any character (
.
,regex::TokenType::DOT
) - Zero or more quantifier (
*
,regex::TokenType::STAR
) - One or more quantifier (
+
,regex::TokenType::PLUS
) - Zero or one quantifier (
?
,regex::TokenType::QMARK
) - Grouping (
()
,regex::TokenType::LPAREN
,regex::TokenType::RPAREN
) - Character classes (
[]
,regex::TokenType::LBRACKET
,regex::TokenType::RBRACKET
) - Negation within character classes (
^
,regex::TokenType::CARET
) - Ranges within character classes (
-
,regex::TokenType::DASH
) - Alternation (
|
,regex::TokenType::PIPE
)
This project uses CMake.
- Ensure CMake is installed.
- Create a build directory:
mkdir build cd build
- Run CMake to configure the project:
(On Windows with Visual Studio, you might specify a generator, e.g.,
cmake ..
cmake .. -G "Visual Studio 17 2022"
) - Build the project:
(Alternatively, open the generated solution file (e.g.,
cmake --build .
RegexParser.sln
inbuild/
) in Visual Studio and build from there)
The executable regex_parser.exe
(or regex_parser
on Linux/macOS) will be created in the build directory (possibly within a Debug
or Release
subdirectory).
Include the header and use the regex::Regex
class.
#include "regex_parser.h"
#include <iostream>
#include <vector>
#include <string>
int main() {
try {
// Example 1: Simple literal match
regex::Regex re1("hello");
std::cout << "Pattern: hello" << std::endl;
std::cout << "Text: hello world -> Match: " << (re1.match("hello world") ? "Yes" : "No") << std::endl;
std::cout << "Text: world -> Match: " << (re1.match("world") ? "Yes" : "No") << std::endl;
std::cout << std::endl;
// Example 2: Using quantifiers and dot
regex::Regex re2("a.*b+c?"); // 'a', followed by zero or more any chars, one or more 'b's, zero or one 'c'
std::cout << "Pattern: a.*b+c?" << std::endl;
std::cout << "Text: axbybbc -> Match: " << (re2.match("axbybbc") ? "Yes" : "No") << std::endl;
std::cout << "Text: ab -> Match: " << (re2.match("ab") ? "Yes" : "No") << std::endl;
std::cout << "Text: ac -> Match: " << (re2.match("ac") ? "Yes" : "No") << std::endl;
std::cout << "Text: abb -> Match: " << (re2.match("abb") ? "Yes" : "No") << std::endl;
std::cout << std::endl;
// Example 3: Alternation and grouping
regex::Regex re3("(cat|dog)s?");
std::cout << "Pattern: (cat|dog)s?" << std::endl;
std::cout << "Text: cats -> Match: " << (re3.match("cats") ? "Yes" : "No") << std::endl;
std::cout << "Text: dog -> Match: " << (re3.match("dog") ? "Yes" : "No") << std::endl;
std::cout << "Text: bird -> Match: " << (re3.match("bird") ? "Yes" : "No") << std::endl;
std::cout << std::endl;
// Example 4: Character classes
regex::Regex re4("[a-zA-Z][0-9]+"); // A letter followed by one or more digits
std::cout << "Pattern: [a-zA-Z][0-9]+" << std::endl;
std::cout << "Text: R2D2 -> Match: " << (re4.match("R2D2") ? "Yes" : "No") << std::endl; // Should match 'R2' part
std::cout << "Text: C3PO -> Match: " << (re4.match("C3PO") ? "Yes" : "No") << std::endl; // Should match 'C3' part
std::cout << "Text: 123 -> Match: " << (re4.match("123") ? "Yes" : "No") << std::endl;
std::cout << std::endl;
} catch (const std::exception& e) {
std::cerr << "Error: " << e.what() << std::endl;
return 1;
}
return 0;
}