Skip to content

Regex Parsing with a Lexer, Parser & Matcher. Supporting: ".*+?()[]^-|" with an Abstract Syntax Tree, Alternation & Repetition Operators

Notifications You must be signed in to change notification settings

WillKirkmanM/regex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Regex

Regex Parsing with a Lexer, Parser & Matcher. Supporting: ".*+?()[]^-|" with an Abstract Syntax Tree, Alternation & Repetition Operators

This project implements a regular expression parser and matching engine in C++. It takes a regex pattern, parses it into an Abstract Syntax Tree (AST), and provides functionality to match the pattern against input strings.

What are Regular Expressions?

Regular Expressions (regex or regexp) are sequences of characters that define a search pattern. They are used to match character combinations in strings.

Key Concepts:

  • Literals: Ordinary characters match themselves (e.g., a matches "a").
  • Metacharacters: Characters with special meanings:
    • . (Dot): Matches any single character (except newline).
    • * (Star): Matches the preceding element zero or more times.
    • + (Plus): Matches the preceding element one or more times.
    • ? (Question Mark): Matches the preceding element zero or one time.
    • | (Pipe): Acts like a boolean OR. Matches the expression before or after the pipe.
    • () (Parentheses): Groups expressions and captures matched text.
    • [] (Brackets): Defines a character set. Matches any one character within the brackets.
      • ^ (Caret inside []): Negates the character set. Matches any character not in the brackets.
      • - (Dash inside []): Defines a range of characters (e.g., [a-z]).
    • ^ (Caret outside []): Matches the beginning of the string/line. (Note: This parser might use ^ only for negation within brackets based on the regex::TokenType).
  • Quantifiers: Specify how many times an element must appear (*, +, ?).
  • Character Classes: Predefined sets like \d (digits), \w (word characters), \s (whitespace). (Note: These might not be implemented in this specific parser).
  • Anchors: Specify positions like beginning (^) or end ($) of the string. (Note: $ might not be implemented).

Project Structure

  • CMakeLists.txt: Build configuration file for CMake.
  • main.cpp: Example usage of the regex library.
  • regex_parser.h: Header file defining the classes and structures (Lexer, Parser, ASTNode, Regex, Token, TokenType).
  • regex_parser.cpp: Implementation of the regex parsing and matching logic.
  • README.md: This file.
  • build/: Directory containing build artifacts generated by CMake.

How this Parser Works

  1. Lexing: The input regex pattern string is processed by the regex::Lexer class. It scans the pattern and converts it into a sequence of regex::Tokens. Each token represents a part of the regex, like a literal character, a metacharacter (*, +, ?, ., |), or grouping symbols ((, ), [, ]). The types are defined in the regex::TokenType enum.
  2. Parsing: The regex::Parser class takes the stream of tokens from the lexer. It uses recursive descent parsing to build an Abstract Syntax Tree (AST) represented by regex::ASTNode objects. The AST hierarchically represents the structure and meaning of the regular expression.
  3. Matching: The regex::Regex class holds the root of the generated AST. Its match method (and potentially findAll) uses the AST to determine if a given input string conforms to the pattern defined by the regex. (The exact matching algorithm, likely involving state machines or recursive traversal of the AST, is implemented in regex_parser.cpp).

Supported Regex Features

Based on regex::TokenType, this parser supports:

  • Literal characters (regex::TokenType::CHAR)
  • Any character (., regex::TokenType::DOT)
  • Zero or more quantifier (*, regex::TokenType::STAR)
  • One or more quantifier (+, regex::TokenType::PLUS)
  • Zero or one quantifier (?, regex::TokenType::QMARK)
  • Grouping ((), regex::TokenType::LPAREN, regex::TokenType::RPAREN)
  • Character classes ([], regex::TokenType::LBRACKET, regex::TokenType::RBRACKET)
  • Negation within character classes (^, regex::TokenType::CARET)
  • Ranges within character classes (-, regex::TokenType::DASH)
  • Alternation (|, regex::TokenType::PIPE)

How to Build

This project uses CMake.

  1. Ensure CMake is installed.
  2. Create a build directory:
    mkdir build
    cd build
  3. Run CMake to configure the project:
    cmake ..
    (On Windows with Visual Studio, you might specify a generator, e.g., cmake .. -G "Visual Studio 17 2022")
  4. Build the project:
    cmake --build .
    (Alternatively, open the generated solution file (e.g., RegexParser.sln in build/) in Visual Studio and build from there)

The executable regex_parser.exe (or regex_parser on Linux/macOS) will be created in the build directory (possibly within a Debug or Release subdirectory).

How to Use (Code Examples)

Include the header and use the regex::Regex class.

#include "regex_parser.h"
#include <iostream>
#include <vector>
#include <string>

int main() {
    try {
        // Example 1: Simple literal match
        regex::Regex re1("hello");
        std::cout << "Pattern: hello" << std::endl;
        std::cout << "Text: hello world -> Match: " << (re1.match("hello world") ? "Yes" : "No") << std::endl;
        std::cout << "Text: world -> Match: " << (re1.match("world") ? "Yes" : "No") << std::endl;
        std::cout << std::endl;

        // Example 2: Using quantifiers and dot
        regex::Regex re2("a.*b+c?"); // 'a', followed by zero or more any chars, one or more 'b's, zero or one 'c'
        std::cout << "Pattern: a.*b+c?" << std::endl;
        std::cout << "Text: axbybbc -> Match: " << (re2.match("axbybbc") ? "Yes" : "No") << std::endl;
        std::cout << "Text: ab -> Match: " << (re2.match("ab") ? "Yes" : "No") << std::endl;
        std::cout << "Text: ac -> Match: " << (re2.match("ac") ? "Yes" : "No") << std::endl;
        std::cout << "Text: abb -> Match: " << (re2.match("abb") ? "Yes" : "No") << std::endl;
        std::cout << std::endl;

        // Example 3: Alternation and grouping
        regex::Regex re3("(cat|dog)s?");
        std::cout << "Pattern: (cat|dog)s?" << std::endl;
        std::cout << "Text: cats -> Match: " << (re3.match("cats") ? "Yes" : "No") << std::endl;
        std::cout << "Text: dog -> Match: " << (re3.match("dog") ? "Yes" : "No") << std::endl;
        std::cout << "Text: bird -> Match: " << (re3.match("bird") ? "Yes" : "No") << std::endl;
        std::cout << std::endl;

        // Example 4: Character classes
        regex::Regex re4("[a-zA-Z][0-9]+"); // A letter followed by one or more digits
        std::cout << "Pattern: [a-zA-Z][0-9]+" << std::endl;
        std::cout << "Text: R2D2 -> Match: " << (re4.match("R2D2") ? "Yes" : "No") << std::endl; // Should match 'R2' part
        std::cout << "Text: C3PO -> Match: " << (re4.match("C3PO") ? "Yes" : "No") << std::endl; // Should match 'C3' part
        std::cout << "Text: 123 -> Match: " << (re4.match("123") ? "Yes" : "No") << std::endl;
        std::cout << std::endl;

    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }

    return 0;
}

About

Regex Parsing with a Lexer, Parser & Matcher. Supporting: ".*+?()[]^-|" with an Abstract Syntax Tree, Alternation & Repetition Operators

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published