copa is a C++26 header-only combinator-based parser library. It allows you to build complex recursive descent parsers
by composing simple rules and mapping them directly to C++ structures (AST objects).
- Key Concepts
- Basic Usage
- Available Matchers
- Provided Helpers
- Important Considerations
- Mapping to AST
- Integrating with Readers
- Copa Reader
- AST Tree Generator
- Rule: A basic building block that matches a part of the input. Rules can be simple (match a character) or complex (composed of other rules).
- Matcher: A specific implementation of a rule (e.g.,
match_char,match_string). - Production: A high-level grammar definition that combines rules with an AST object and a result convertor.
- Sink (Aggregator): Collects results from matchers and populates the AST object.
To use copa, you define a Production. A production is a struct or class that defines:
ast_object: The type that will hold the parsing results.rules(): A static method returning the composed rules.convertor(): A static method returning an aggregator for theast_object.
The following example parses a string of format: "CMD {command_name} ON {target} {options} "
#include <iostream>
#include <vector>
#include <fil/copa/copa.hh>
#include <fil/copa/matcher.hh>
#include <fil/copa/sink.hh>
#include <fil/meta/buffer_reader.hh>
// 1. Define your AST structure
struct MyCommand {
std::string command;
std::string target;
std::vector<std::string> options;
};
// 2. Define the Grammar Production
struct CommandGrammar {
using ast_object = MyCommand;
static constexpr auto rules() {
using namespace fil::copa;
// Rule: "CMD" <id:command> "ON" <id:target> [options...]
return match_string<fixed_string{"CMD"}>{}
+ match_identifier<member<&MyCommand::command>>{}
+ match_string<fixed_string{"ON"}>{}
+ match_identifier<member<&MyCommand::target>>{}
+ list_rule<match_identifier<member<&MyCommand::options>>>{};
}
static constexpr auto convertor() {
return fil::copa::sink::aggregator<ast_object>{};
}
};
int main() {
std::string input = "CMD start ON engine turbo fast";
fil::buffer_reader reader(std::move(input));
CommandGrammar grammar;
// 3. Parse the input
auto result = fil::copa::parse(grammar, std::move(reader));
// result.has_value() == true
// result->command == start
// result->target == engine
// result->options == { turbo, fast }
}copa provides several built-in matchers in the fil::copa namespace:
Greedy Matcher: Copa design is based on greedy matching. If a matcher can match multiple times, it will match until it can't anymore. For instance, this means that it is impossible to match a list of integers followed by an integer of a specific value. As the list would consume that integer.
The design of the parser must take that into account.
match_char<char C>: Matches a single characterC.match_string<fixed_string {S}>: Matches an exact stringS.match_space_like: Matches whitespace characters (space, tab, newline).match_identifier: Matches an alphanumeric sequence (identifier).match_number: Matches numeric values.
tuple_rule<Rules...>: Sequence of rules to be provided in order.
This rule can be added to an instance of a rule by using the+operator.or_rule<Rules...>: Rule tried in order, at least one must be matching.
This rule can be added to an instance of a rule by using the|operator.list_rule<Rule>: Matches zero or more occurrences ofRule.repeat<int N, Rule>: Repeat N times the providedRule.
Rules can be composed using overloaded operators, every rule must inherit from comsable_rule to ensure that
composability across every rule:
- Sequence (
+):RuleA + RuleBmatchesRuleAfollowed byRuleB. (usingtuple_rule) - Alternation (
|):RuleA | RuleBmatchesRuleAorRuleB(choice). (usingor_rule) - Optional (
~):~RuleAmatchesRuleAor not (choice). (usingor_rule)
Example:
// Matches either "TRUE" or "FALSE"
auto boolean_rule = match_string<fixed_string{"TRUE"}>{} | match_string<fixed_string{"FALSE"}>{};may_rule<Rule>: Matches zero or one of the providedRule.
This uses the or_rule behind the hood and thus falls under the performance consideration stated below. The main thing to note is: it is expensive to use the may_rule with a tuple_rule.
This rule can be added to an instance of a rule by using the ~ operator.
fil::copa::match_semicol: Matches;charfil::copa::match_comma: Matches,charfil::copa::match_if: Matchesifstringfil::copa::match_while: Matcheswhilestringfil::copa::match_space_like: Matches space like as defined in the C standard (std::isspace)
Some are helping common composition:
fil::copa::parenthesis_wrapped<...>: Matches the rule wrapped in parenthesis(rule)fil::copa::bracket_wrapped<...>: Matches the rule wrapped in curly brackets{rule}fil::copa::square_wrapped<...>: Matches the rule wrapped in square brackets[rule]fil::copa::angle_wrapped<...>: Matches the rule wrapped in angle brackets<rule>fil::copa::apostrophed_wrapped<...>: Matches the rule wrapped in quotes"rule"
Some utility function to be able to use those wrappers utilitie as value (and thus comnining with |, + or ~)
fil::copa::parenthesised(const Rule auto rule) -> fil::copa::parenthesis_wrapped<Rule>:fil::copa::apostrophed(const Rule auto rule) -> fil::copa::apostrophed_wrapped<Rule>:
Avoid for performance reason
While it is syntactically possible to use (sequences created with the + operator) as alternatives within an or_rule,
this pattern is not recommended due to performance implications.
When a tuple_rule is used as an alternative in an or_rule the parser must create a complete copy of the convertor
state before attempting to match that alternative. This is necessary to ensure proper rollback semantics: if the
tuple_rule fails during matching, the parser needs to restore the convertor to its previous state before trying the
next alternative.
Example to avoid:
// INEFFICIENT: tuple_rule inside or_rule requires convertor copies
auto rule = (match_string<"ABC">{} + match_identifier{})
| (match_string<"DEF">{} + match_identifier{});This pattern triggers expensive deep copies of the convertor whenever backtracking is needed. For better performance, consider restructuring.
This consideration is particularly important in performance-critical parsing scenarios or when dealing with large inputs.
Avoid it for behavioral reasons:
When using or_rule (the | operator) with a tuple_rule (the + operator) that contains multiple parsing rules with
side effects, a critical issue emerges: side effects from partially matched rules are not rolled back when the overall
alternative fails.
It is advised to do the following to avoid or_rule of tuple_rule:
- Flattening the alternatives: Combine non-overlapping prefixes into a single rule when possible
- Restructuring the grammar: Rearrange rules to avoid using parser rule instead
- Using lookahead patterns: Design your grammar to make early decisions before committing to longer sequences. Using a keyword to separate different possibilities is a good solution.
To map parsed values to your ast_object, use the member template:
member<&Class::field>: Assigns the matched value to the specified field.- If the field is a
std::vector,memberwill automaticallypush_backthe value. - If the field is a setter method
void set_field(Value), it will call that method.
The parse function takes a reader. The library provides fil::buffer_reader for in-memory strings and is designed
to work with fil::file_reader for disk-based parsing.
#include <fil/file/file_reader.hh>
// ...
fil::file_reader reader(std::filesystem::path("input.txt"));
auto result = fil::copa::parse(grammar, std::move(reader));A reader in Copa is an abstraction that provides sequential access to input data. Whether you're parsing from memory buffers, files, network streams, or custom data sources, implementing a reader allows Copa to work seamlessly with your data source.
A reader is responsible for:
- Sequentially accessing data - Providing bytes one at a time
- Cursor management - Tracking the current position in the input
- Backtracking support - Allowing the parser to revert to previous states
- Peek operations - Looking ahead without consuming data
Your custom reader class must implement the following interface:
| Method | Return Type | Description |
|---|---|---|
next_byte() |
std::optional<std::uint8_t> |
Returns and consumes the next byte |
previous_byte() |
std::optional<std::uint8_t> |
Returns and un-consumes the previous byte |
peek() |
std::optional<std::uint8_t> |
Returns the next byte without consuming it |
reader_cursor() |
std::size_t |
Returns the current cursor position |
When a rule fails (e.g., in an or_rule), the parser must backtrack. Readers implement this via a shallow_copy
specialization, which allows the parser to save the current state (cursor) without duplicating the underlying data. This
is done for performance reasons to avoid unnecessary copy of the data.
If a shallow copy is not implemented, a normal copy of the reader would be made (which could imply a deepcopy or a copy
that has the read cursor badly set).
Imagine parsing with an or_rule that tries multiple alternatives. For each alternative, the parser needs to save the
reader's state. If the alternative fails, it restores the reader to the saved state and tries the next alternative.
Without shallow copy: Every backtracking attempt would require a full copy of your data (extremely inefficient for large files).
With shallow copy: Only the cursor position is copied, while the actual data remains shared. This is safe because:
- The original reader stays alive during the entire parse operation
- All shallow copies are destroyed after parsing completes
- No modifications are made to the underlying data
Specialize the fil::shallow_copy template for your reader type:
// specialization of the shallow_copy
template<>
struct shallow_copy<YourReaderType> {
// Create a shallow copy with shared data and copied cursor
static constexpr auto copy(const YourReaderType& object) {
//... implement a copy that does not imply a full data copy
// easiest way is to declare shallow_copy as a friend of your reader
}
static constexpr void assign(YourReaderType& object, YourReaderType&& other) {
// Handle move assignment (usually just cursor update)
}
};The ast_tree_generator is a specialized convertor sink in Copa designed to build abstract syntax trees (AST) with
proper operator precedence handling. Unlike the simple aggregator convertor that directly assigns values to struct
members, ast_tree_generator constructs a hierarchical tree structure suitable for expression parsing, especially when
handling operators with different precedence levels.
- Expression parsers with operators of varying precedence (e.g., calculator with
+,-,*,/) - Binary operator trees where parentheses or precedence rules determine the tree structure
- Recursive grammar rules where multiple precedence levels are defined separately
- Simple data structures that don't require tree hierarchies
- Non-expression-based grammars (use
aggregatorinstead)
An ast_node is a binary tree node that represents an operation or value in an expression. It consists of:
template<std::invocable<std::string> auto CallbackOp>
struct ast_node {
using operand_type = std::invoke_result_t<decltype(CallbackOp), std::string>;
operand_type value; // The operator or value
std::variant<std::shared_ptr<ast_node>,
std::string, int, char> lhs; // Left-hand side child
std::variant<std::shared_ptr<ast_node>,
std::string, int, char> rhs; // Right-hand side child
struct operand : callback<CallbackOp> {}; // Callback for operators
struct leaf : callback<[](const std::string& value) {
return value;
}> {}; // Callback for operands/leaves
};value: Stores the operator or result at this node (transformed via theoperandcallback)lhs(left): The left child - can be another AST node, string, int, or charrhs(right): The right child - can be another AST node, string, int, or charoperandcallback: Transforms matched operator tokens into meaningful valuesleafcallback: Marks operand/literal values (the leaves of the tree)
The ast_tree_generator takes a precedence level parameter:
explicit constexpr ast_tree_generator(std::uint32_t precedence = 0)
: precedence_(precedence) {}- Higher precedence operators (e.g.,
*and/have precedence 2) bind tighter than lower ones - Lower precedence operators (e.g.,
+and-have precedence 1) bind looser - The precedence determines how the tree is restructured during parsing
Create a callback that transforms operator tokens into meaningful values:
enum class op : int {
INVALID,
PLUS,
MINUS,
MULTIPLY,
DIVIDE
};
using ast_node = fil::copa::ast_node<[](const std::string& token) -> op {
if (token == "+") return op::PLUS;
if (token == "-") return op::MINUS;
if (token == "*") return op::MULTIPLY;
if (token == "/") return op::DIVIDE;
return op::INVALID;
}>;Create separate grammar rules for each precedence level, each with its own ast_tree_generator:
// Lowest precedence: addition and subtraction
struct level_1_grammar {
using ast_object = ast_node;
static constexpr auto rules() {
return fil::copa::match_string<fil::fixed_string {"+"}, ast_node::operand> {}
| fil::copa::match_string<fil::fixed_string {"-"}, ast_node::operand> {};
}
static constexpr auto convertor() {
return fil::copa::sink::ast_tree_generator<ast_node> {1}; // precedence = 1
}
};
// Higher precedence: multiplication and division
struct level_2_grammar {
using ast_object = ast_node;
static constexpr auto rules() {
return fil::copa::match_string<fil::fixed_string {"*"}, ast_node::operand> {}
| fil::copa::match_string<fil::fixed_string {"/"}, ast_node::operand> {};
}
static constexpr auto convertor() {
return fil::copa::sink::ast_tree_generator<ast_node> {2}; // precedence = 2
}
};Define rules for operands (leaves of the tree):
struct base_grammar {
using ast_object = ast_node;
static constexpr auto rules() {
return fil::copa::match_number<ast_node::leaf> {}
| fil::copa::parenthesised(
fil::copa::match_parser<calculator_grammar, ast_node::leaf> {});
}
static constexpr auto convertor() {
return fil::copa::sink::ast_tree_generator<ast_node> {0}; // precedence = 0 (base level)
}
};Combine all precedence levels into a complete grammar:
struct calculator_grammar {
using ast_object = ast_node;
static constexpr auto rules() {
return fil::copa::list_rule<fil::copa::or_rule<
fil::copa::match_parser<base_grammar>, // Try base first (operands)
fil::copa::match_parser<level_1_grammar>, // Then level 1 operators
fil::copa::match_parser<level_2_grammar> // Then level 2 operators
>> {};
}
static constexpr auto convertor() {
return fil::copa::sink::ast_tree_generator<ast_node> {0};
}
};Here's a complete calculator example that parses and builds an AST for mathematical expressions:
#include <fil/copa/copa.hh>
#include <fil/copa/sink.hh>
#include <fil/meta/buffer_reader.hh>
enum class op : int {
INVALID, PLUS, MINUS, MULTIPLY, DIVIDE
};
using ast_node = fil::copa::ast_node<[](const std::string& token) -> op {
if (token == "+") return op::PLUS;
if (token == "-") return op::MINUS;
if (token == "*") return op::MULTIPLY;
if (token == "/") return op::DIVIDE;
return op::INVALID;
}>;
// Multiplication and Division (precedence 2)
struct level_2_grammar {
using ast_object = ast_node;
static constexpr auto rules() {
return fil::copa::match_string<fil::fixed_string {"*"}, ast_node::operand> {}
| fil::copa::match_string<fil::fixed_string {"/"}, ast_node::operand> {};
}
static constexpr auto convertor() {
return fil::copa::sink::ast_tree_generator<ast_node> {2};
}
};
// Addition and Subtraction (precedence 1)
struct level_1_grammar {
using ast_object = ast_node;
static constexpr auto rules() {
return fil::copa::match_string<fil::fixed_string {"+"}, ast_node::operand> {}
| fil::copa::match_string<fil::fixed_string {"-"}, ast_node::operand> {};
}
static constexpr auto convertor() {
return fil::copa::sink::ast_tree_generator<ast_node> {1};
}
};
// Operands and Base Grammar (precedence 0)
struct base_grammar {
using ast_object = ast_node;
//! defined below as recursion over calculator_grammar requires the class definition to be defined before the function definition
static constexpr auto rules();
static constexpr auto convertor() {
return fil::copa::sink::ast_tree_generator<ast_node> {0};
}
};
// Complete Grammar
struct calculator_grammar {
using ast_object = ast_node;
static constexpr auto rules() {
return fil::copa::list_rule<fil::copa::or_rule<
fil::copa::match_parser<base_grammar>,
fil::copa::match_parser<level_1_grammar>,
fil::copa::match_parser<level_2_grammar>
>> {};
}
static constexpr auto convertor() {
return fil::copa::sink::ast_tree_generator<ast_node> {0};
}
};
constexpr auto base_grammar::rules() {
return fil::copa::match_number<ast_node::leaf> {}
| fil::copa::parenthesised(
fil::copa::match_parser<calculator_grammar, ast_node::leaf> {});
}
int main() {
std::string input = "16 * (1337 + 42)";
fil::buffer_reader reader(std::move(input));
auto result = fil::copa::parse(calculator_grammar{}, std::move(reader));
if (result) {
// The resulting AST structure represents the expression
// with proper precedence handling:
// *
// / \
// 16 +
// / \
// 1337 42
std::cout << "Parse successful!" << std::endl;
}
}- Precedence Handling: Tree restructuring is O(depth) for each operator, typically very efficient
- Memory: Uses
std::shared_ptrfor node management; consider memory usage with very deep trees - Callback Overhead: The operator callback is invoked once per operator; keep it lightweight
- Variant Storage: The use of
std::variantforlhsandrhsprovides flexibility but has minor runtime overhead