Recommendations for handling erroneous input

I have been using X3 for a while and I have trouble deciding where to handle invalid input so I'm asking for recommendations - perhaps the library authors have more experience. I haven't found any guidelines for this in the documentation. I think problems like this one (and many general recommendations) could be a worthwhile addition to the documentation - the library is focused on a very narrow topic and I predict not all library users are very knowledgeable in this area.

The problem is as follows: the input contains invalid tokens, but they can be someway dealt with. For example, if a C++ compiler sees `int x = foo();` and there is no definition of `foo`, it can issue an error and still continue because it knows the type of `x` so further code is not a problem. At worst punctuation problems it can skip to `;` and assume that a statement did not exist.

I'm having similar issue: I have a lot of keyword-like elements in the grammar but I'm not sure if:

A) Should I make the grammar very precise and fail the parsing upon any slight error in the grammar? (That is, keep it as a parse error)
B) Should I make the grammar more flexible by allowing all kinds of literals that my "language" supports and then output errors in code that traverses the complete AST? (That is, turn parsing errors into semantic errors by making the grammar more flexible)

For example, let's say that in some text language format such lines are valid:

```
SetColor 255 0 255 255
SetColor 255 0 255
SetColor 0xff00ff
SetColor 0xff00ffff
SoundEffect true
SoundEffect false
```

A (precise grammar version)

```cpp
auto const color_statement =
    "SetColor" >> (int_ >> int_ >> int_ >> -int_) | lexeme["0x" >> hex_char_table] >> eol;
```

B (loose grammar with more semantic analysis required)

```cpp
auto const token = boolean_keywords_table | lexeme["0x" >> hex_char_table] | int_;
auto const color_statement = "SetColor" >> +token >> eol;
```

In the case of A, parsing will fail much more often on any input problem and there will be many complex subgrammars for each "command".

In the case of B, parsing will often succeed; even if the input had something like `SetColor 255 255` or `SetColor true` but there will be very few complex grammars as most will utilize the `+token >> eol` part, making the parser simpler but requiring more code to work with the AST.

Is there any general recommendation for this type of problems or it is so case-specific that I better use my own judgement? Handling semantic errors is much easier for me than syntaxic ones, and in case of semantic errors it's much more likely to make the code go forward because fixing/faking semantic state is way less complex than fixing parser's position and adjacent AST nodes.

Last thing, if B is better, how about such grammars?

```cpp
auto const broken_token = *(char_ - eol);
auto const token = boolean_keywords_table | lexeme["0x" >> hex_char_table] | int_ | broken_token;
```

In this extreme case of flexibility, the program can accept a huge variety of invalid inputs and the code that works with complete AST will always error upon `broken_literal` but then we have the highest likehood of being able to continue. I have no idea whether always-invalid-things such as `broken_literal` are a good design though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommendations for handling erroneous input #630

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Recommendations for handling erroneous input #630

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions