Skip to content

Recommendations for handling erroneous input #630

@Xeverous

Description

@Xeverous

I have been using X3 for a while and I have trouble deciding where to handle invalid input so I'm asking for recommendations - perhaps the library authors have more experience. I haven't found any guidelines for this in the documentation. I think problems like this one (and many general recommendations) could be a worthwhile addition to the documentation - the library is focused on a very narrow topic and I predict not all library users are very knowledgeable in this area.

The problem is as follows: the input contains invalid tokens, but they can be someway dealt with. For example, if a C++ compiler sees int x = foo(); and there is no definition of foo, it can issue an error and still continue because it knows the type of x so further code is not a problem. At worst punctuation problems it can skip to ; and assume that a statement did not exist.

I'm having similar issue: I have a lot of keyword-like elements in the grammar but I'm not sure if:

A) Should I make the grammar very precise and fail the parsing upon any slight error in the grammar? (That is, keep it as a parse error)
B) Should I make the grammar more flexible by allowing all kinds of literals that my "language" supports and then output errors in code that traverses the complete AST? (That is, turn parsing errors into semantic errors by making the grammar more flexible)

For example, let's say that in some text language format such lines are valid:

SetColor 255 0 255 255
SetColor 255 0 255
SetColor 0xff00ff
SetColor 0xff00ffff
SoundEffect true
SoundEffect false

A (precise grammar version)

auto const color_statement =
    "SetColor" >> (int_ >> int_ >> int_ >> -int_) | lexeme["0x" >> hex_char_table] >> eol;

B (loose grammar with more semantic analysis required)

auto const token = boolean_keywords_table | lexeme["0x" >> hex_char_table] | int_;
auto const color_statement = "SetColor" >> +token >> eol;

In the case of A, parsing will fail much more often on any input problem and there will be many complex subgrammars for each "command".

In the case of B, parsing will often succeed; even if the input had something like SetColor 255 255 or SetColor true but there will be very few complex grammars as most will utilize the +token >> eol part, making the parser simpler but requiring more code to work with the AST.

Is there any general recommendation for this type of problems or it is so case-specific that I better use my own judgement? Handling semantic errors is much easier for me than syntaxic ones, and in case of semantic errors it's much more likely to make the code go forward because fixing/faking semantic state is way less complex than fixing parser's position and adjacent AST nodes.

Last thing, if B is better, how about such grammars?

auto const broken_token = *(char_ - eol);
auto const token = boolean_keywords_table | lexeme["0x" >> hex_char_table] | int_ | broken_token;

In this extreme case of flexibility, the program can accept a huge variety of invalid inputs and the code that works with complete AST will always error upon broken_literal but then we have the highest likehood of being able to continue. I have no idea whether always-invalid-things such as broken_literal are a good design though.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions