-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation example of retrieving position information of an arbitrary node #29
base: master
Are you sure you want to change the base?
Conversation
3 similar comments
def get_node_position(context, nodes, index): | ||
if index < 0: | ||
index = index + len(nodes) # Handle negative list indices. | ||
node_startpos = context.start_position + sum(len(nodes[n]) for n in range(index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function assumes that all nodes
have len
defined which might not be true as the elements of nodes
can be arbitrary objects produced by lower level actions.
The only way I see to track positions of individual nodes is to take note of them in all actions. That is handled by parglare if tree building is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context
has a start and end position, so it must know the length of everything. Unfortunately, it does not provide a start position for each token, it only does that for the production rule as a whole. (That is, what I really want is context.token_start_position[token_index]
.) As a work-around, I need to calculate the offset from the production rule start to the token, so I can use the start position of the rule, and apply parglare.pos_to_line_col
to get a human-readable terminal position.
I wrote this code, since in my opinion, line (and column) information is required in almost all non-trivial programs that use a parser. Parglare doesn't provide a standard function for it, so I think almost every Parglare user will have to dig through the API in the manual like I did, figuring out how to get useful position information. The example code shows how to do it in a common use case (ie parsing plain text).
While for you, parsing is the one and only job, for me, parsing is just the simple first step (takes a day or two if the language design is finished, and if it fits nicely in the parser assumptions).
Right behind it is static semantics checking and/or interpretation of the input, and error detection and reporting about the input is a dominant part there (takes several months at least rather than 2 days). In my experience, you always need line/column position from the input, with ascii-ish text being the dominant stuff being parsed, and the need to be able to fallback to dumping error messages onto stderr
. (That is, incorrect type at line 32, column 54
is so much nicer than incorrect type in the type definition starting at offset 33536
.)
I agree you need length of a token in my code, but I don't know how else to get a (line, column)
position of an arbitrary terminal in a production rule. As a secondary consideration, in case there is no length of a token, my guess is that a (line, column)
notion won't make sense either, ie, you don't need this code in that case.
I didn't look at building a parse tree, since my actions immediately abstract to an AST. I don't want a tree with all irrelevant stuff still in there, since I will have to crawl over it with manually written code rather than with your automagically generated callbacks. (That is, a parse tree is more work.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, probably I wasn't clear enough. It's not that I think that position is not important. It is very much important to me as well as the parsing is not the only thing I do. I do stuff after parsing :)
The note I made was that the code above assumes that all objects returned from the lower level actions in the bottom-up processing have length defined which might not be true. So, this example should make that clear.
parglare idea is to be general enough and flexible to be used in different context. So I try to provide bare minimum functionality and build on top of it. For example, support for assignments and auto-building of ASTs is provided using built-in obj
action which uses the same action calling mechanism as any other action.
Keeping positions of all tokens might be a significant waste of space/time if the information is not needed. If it is needed than a simple action decorator that is applied to all actions and keeps track of all tokens shouldn't be more than a few lines of code in Python. I've been doing recently some support for action decorators to make action registration easier.
It sounds like a good idea to provide decorator for position tracking in parglare, and that would be a more general approach which won't have assumption about results of the previous actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I typically need the position of one token in a production rule (which gets copied into the AST). I can even point out which token(s) exactly in advance. In another parser we use the @
to denote a token should be an argument in the callback, so we can retrieve its position, eg
@Override // CompoundProgram : @PROCKW @IDENTIFIERTK PAROPENTK PARCLOSETK COLONTK Body @ENDKW;
public List<Declaration> parseCompoundProgram01(Token t1, Token t2, ParserBody p6, Token t7) {
It's the wrong programming language, but you get the idea :p
This parser is much more aimed at parsing text, so 'Token.position' is a line and column already, which avoids a lot of boilerplate position conversions.
Another option could be to drop all the position conversion here, and just get and store the offset of a token in the AST. Then, conversion becomes a separate problem outside parsing, although I can see a user wanting to convert the offset into position while constructing the AST rather than at error reporting time. You'd likely want to provide a simple additional API for the 'text input' case though as a service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. You can easily do that now in parglare. Default actions in parglare will convert token to string it matches and non-terminal rule to a Python list. If you use assignments, obj
action will be executed on those rules by default which creates object with attributes created by the LHS of assignments and values matched by RHS. The class of the object will be named after the grammar rule. Thus, you have automatic AST building out-of-the box. Position is kept in all AST created objects as _pg_start_position
and _pg_end_position
(that is done by obj
action). The only thing you might need in addition is the position of tokens/terminal matches. That can be easily provided by overriding default shift action that converts token to a string.
For example:
@token some_token: /.../;
@token some_other_token: /.../;
where token
is action you provide that will take what is matched and create object with matched content and position for example. Notice that in this way you can choose which token keeps its position.
The other way to achieve similar is by converting each terminal rule to a non-terminal rule using assignments:
some_token: value=/.../;
some_other_token: value=/.../;
Now, these rules are actually non-terminal rules using assignments. They will be processed by obj
action and they will get _pg_start/end_position
attributes. Matched value will be in the value
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great that it's already possible. The main problem might be lack of a "how to get positions in my ast" guide?
I typically don't use automagic AST generators, since the structure of the parsed text normally does not match the structure of my AST. As a simple example, consider
vardef: VAR type variables SEMICOL ;
type: NAME ;
variables: NAME | variables COMMA NAME ;
NAME: /[A-Z][A-Za-z0-9]+/;
This declares variables in C style, ie int i, j, k;
In my AST, I'd like
t = Type('int', intPos)
AST = [VarDecl(t, 'i', iPos),
VarDecl(t, "j", jPos),
VarDecl(t, "k", kPos)]
This makes further processing of variable declarations easier. Automagic AST generation can't handle such structural changes, for the simple reason that it doesn't know what I want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great that it's already possible. The main problem might be lack of a "how to get positions in my ast" guide?
Yes, there is a lack of this in the docs. I'll see to provide some form of action decorator for this and document it.
I typically don't use automagic AST generators, since the structure of the parsed text normally does not match the structure of my AST. As a simple example, consider...
You got the point. Indeed, there is a need to manually handle such cases. That's why I didn't go with textX route like before where all you have is just auto-AST building. In parglare you have some sane defaults (at least sane in my view :) ) but you can override what you don't like. Here you could override default obj
action for vardef
and provide your own as there is the change in the structure. But for the part where the syntactic structure fits wanted AST you don't need to do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's less work, but I prefer to have a proper class for each AST node, since it has room for documentation, and it serves as a stable, independent, easy to find reference, while working on semantics checking. That's likely a personal preference though, ymmv.
Hopefully reducing time spend looking for how to do X :)
Maybe move this function into the regular parglare code?