feat(parsers): extract access modifiers and decorators via highlights.scm#566
feat(parsers): extract access modifiers and decorators via highlights.scm#566ChetanyaRathi wants to merge 5 commits into
Conversation
…ghlights - Append custom fallback decorator highlights for languages missing them in upstream tree-sitter packages. - Expand modifier extraction to check wrapper nodes (like decorated_definition). - Remove obsolete decorator extraction tests.
…guage extraction tests
…lers and update ingest tests
There was a problem hiding this comment.
Code Review
This pull request refactors decorator and modifier extraction across multiple languages by replacing language-specific handler methods with a unified utility that leverages tree-sitter highlights queries. It also updates the schema to store modifiers for classes, functions, and methods. The review feedback highlights three key issues: a bug in query loading where a module import failure incorrectly skips fallback queries, noisy modifier extraction that captures definition keywords like def or class, and a parsing failure in _decorator_tail_names when handling decorators with arguments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try: | ||
| module_name = f"{cs.TREE_SITTER_MODULE_PREFIX}{lang_name.replace('-', '_')}" | ||
| module = importlib.import_module(module_name) | ||
|
|
||
| query_str = "" | ||
| if hasattr(module, "HIGHLIGHTS_QUERY"): | ||
| query_str = module.HIGHLIGHTS_QUERY | ||
|
|
||
| fallback_path = ( | ||
| Path(__file__).parent / "queries" / "highlights" / f"{lang_name}.scm" | ||
| ) | ||
| if fallback_path.exists(): | ||
| custom_queries = fallback_path.read_text(encoding="utf-8") | ||
| query_str = ( | ||
| query_str + "\n" + custom_queries if query_str else custom_queries | ||
| ) | ||
|
|
||
| if query_str: | ||
| return Query(language, query_str) | ||
| except Exception as e: | ||
| logger.debug(f"Failed to load highlights query for {lang_name}: {e}") | ||
| return None |
There was a problem hiding this comment.
If the tree-sitter module import fails (e.g., ModuleNotFoundError or ImportError), the entire try block is aborted, which completely skips loading the fallback highlights query from fallback_path. This is a major bug because fallback queries should still be loaded even if the module import fails (or if the module is loaded from a submodule where the bindings path is no longer in sys.path). Separating the module import/query extraction from the fallback path loading using separate try-except blocks is much more robust and correct.
query_str = ""
try:
module_name = f"{cs.TREE_SITTER_MODULE_PREFIX}{lang_name.replace('-', '_')}"
module = importlib.import_module(module_name)
if hasattr(module, "HIGHLIGHTS_QUERY"):
query_str = module.HIGHLIGHTS_QUERY
except Exception as e:
logger.debug(f"Failed to import tree-sitter module for {lang_name}: {e}")
try:
fallback_path = (
Path(__file__).parent / "queries" / "highlights" / f"{lang_name}.scm"
)
if fallback_path.exists():
custom_queries = fallback_path.read_text(encoding="utf-8")
query_str = (
query_str + "\n" + custom_queries if query_str else custom_queries
)
if query_str:
return Query(language, query_str)
except Exception as e:
logger.debug(f"Failed to load highlights query for {lang_name}: {e}")
return None| def extract_modifiers_and_decorators( | ||
| node: ASTNode, lang_queries: LanguageQueries | ||
| ) -> tuple[list[str], list[str]]: | ||
| query = lang_queries.get("highlights") | ||
| if not query: | ||
| return [], [] | ||
|
|
||
| cursor = get_query_cursor(query) | ||
|
|
||
| body_node = node.child_by_field_name("body") | ||
| header_end_byte = body_node.start_byte if body_node else node.end_byte | ||
|
|
||
| target_node = node | ||
| if node.parent and node.parent.type in ("decorated_definition", "export_statement"): | ||
| target_node = node.parent | ||
|
|
||
| cursor.set_byte_range(target_node.start_byte, header_end_byte) | ||
|
|
||
| captures = sorted_captures(cursor, target_node) | ||
|
|
||
| modifiers: list[str] = [] | ||
| decorators: list[str] = [] | ||
|
|
||
| for name, nodes in captures.items(): | ||
| if name.startswith("keyword.modifier") or name == "keyword": | ||
| for n in nodes: | ||
| text = safe_decode_text(n) | ||
| if text and text not in modifiers: | ||
| modifiers.append(text) | ||
| elif name.startswith("attribute") or name.startswith("function.decorator"): | ||
| for n in nodes: | ||
| text = safe_decode_text(n) | ||
| if text and text not in decorators: | ||
| decorators.append(text) | ||
|
|
||
| return modifiers, decorators |
There was a problem hiding this comment.
The code extracts any keyword captured as @keyword or @keyword.modifier as a modifier. However, standard highlights queries capture definition keywords (like def, class, fn, struct, impl, interface, enum, function, trait, type) and literal keywords (like None, True, False, null, true, false, void) as @keyword. This results in noisy and incorrect modifiers (e.g., modifiers=["def"] for Python functions, modifiers=["class"] for Python classes, modifiers=["fn"] for Rust functions, etc.). We should define a set of excluded keywords and filter them out when extracting modifiers.
_EXCLUDED_KEYWORDS = frozenset({
"def", "class", "fn", "struct", "impl", "interface", "enum", "function", "trait", "type", "void",
"None", "True", "False", "null", "true", "false", "return", "import", "from", "as", "where"
})
def extract_modifiers_and_decorators(
node: ASTNode, lang_queries: LanguageQueries
) -> tuple[list[str], list[str]]:
query = lang_queries.get("highlights")
if not query:
return [], []
cursor = get_query_cursor(query)
body_node = node.child_by_field_name("body")
header_end_byte = body_node.start_byte if body_node else node.end_byte
target_node = node
if node.parent and node.parent.type in ("decorated_definition", "export_statement"):
target_node = node.parent
cursor.set_byte_range(target_node.start_byte, header_end_byte)
captures = sorted_captures(cursor, target_node)
modifiers: list[str] = []
decorators: list[str] = []
for name, nodes in captures.items():
if name.startswith("keyword.modifier") or name == "keyword":
for n in nodes:
text = safe_decode_text(n)
if text and text not in modifiers and text not in _EXCLUDED_KEYWORDS:
modifiers.append(text)
elif name.startswith("attribute") or name.startswith("function.decorator"):
for n in nodes:
text = safe_decode_text(n)
if text and text not in decorators:
decorators.append(text)
return modifiers, decorators| def _decorator_tail_names(decorators: list[str]) -> set[str]: | ||
| return { | ||
| decorator.lstrip(cs.DECORATOR_AT).split(cs.SEPARATOR_DOT)[-1] | ||
| decorator.lstrip("@#[]() ").split(cs.SEPARATOR_DOT)[-1].rstrip(")]") | ||
| for decorator in decorators | ||
| } |
There was a problem hiding this comment.
The code decorator.lstrip("@#[]() ").split(cs.SEPARATOR_DOT)[-1].rstrip(")]") fails to correctly extract the bare name of decorators with arguments (e.g., @cached_property(ttl=3600) or @Component({selector: 'app-root'})). For @cached_property(ttl=3600), it returns cached_property(ttl=3600, which fails to match cached_property in cs.PROPERTY_DECORATORS. We should split on ( first to strip any arguments before extracting the tail name.
| def _decorator_tail_names(decorators: list[str]) -> set[str]: | |
| return { | |
| decorator.lstrip(cs.DECORATOR_AT).split(cs.SEPARATOR_DOT)[-1] | |
| decorator.lstrip("@#[]() ").split(cs.SEPARATOR_DOT)[-1].rstrip(")]") | |
| for decorator in decorators | |
| } | |
| def _decorator_tail_names(decorators: list[str]) -> set[str]: | |
| return { | |
| decorator.lstrip("@#[]() ").split("(")[0].split(cs.SEPARATOR_DOT)[-1].rstrip(")] ") | |
| for decorator in decorators | |
| } |
References
- When parsing decorators, annotations, or attributes, extract the full text including arguments, not just the name. This preserves crucial semantic information (e.g., arguments in
@RequestMapping(value="/api")or#[derive(Debug)]) for RAG queries and ensures consistency across all supported languages (Python, Java, Rust, TypeScript).
Greptile SummaryThis PR moves modifier and decorator extraction into a shared highlights-query path. The main changes are:
Confidence Score: 3/5The shared extraction refactor has targeted regressions that can silently drop decorator metadata for supported languages. The implementation is broadly covered by language tests, but the highlighted code paths leave Rust sibling attributes and vendored-grammar fallback highlights unhandled. codebase_rag/parsers/utils.py and codebase_rag/parser_loader.py
What T-Rex did
|
| target_node = node | ||
| if node.parent and node.parent.type in ("decorated_definition", "export_statement"): | ||
| target_node = node.parent | ||
|
|
||
| cursor.set_byte_range(target_node.start_byte, header_end_byte) |
There was a problem hiding this comment.
Rust attributes are skipped
For Rust, outer attributes like #[test] and #[derive(Debug)] are sibling attribute_item nodes before the function or class node, not children of it. This range starts at node.start_byte and queries only target_node, so the shared extractor never sees those siblings. Rust methods and classes now get empty decorators where the removed handler walked prev_named_sibling.
Artifacts
Repro: focused Rust attribute ingestion script
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: failing parser ingestion output showing empty method decorators
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/parsers/utils.py
Line: 102-106
Comment:
**Rust attributes are skipped**
For Rust, outer attributes like `#[test]` and `#[derive(Debug)]` are sibling `attribute_item` nodes before the function or class node, not children of it. This range starts at `node.start_byte` and queries only `target_node`, so the shared extractor never sees those siblings. Rust methods and classes now get empty `decorators` where the removed handler walked `prev_named_sibling`.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| module_name = f"{cs.TREE_SITTER_MODULE_PREFIX}{lang_name.replace('-', '_')}" | ||
| module = importlib.import_module(module_name) | ||
|
|
||
| query_str = "" | ||
| if hasattr(module, "HIGHLIGHTS_QUERY"): | ||
| query_str = module.HIGHLIGHTS_QUERY | ||
|
|
||
| fallback_path = ( | ||
| Path(__file__).parent / "queries" / "highlights" / f"{lang_name}.scm" | ||
| ) | ||
| if fallback_path.exists(): | ||
| custom_queries = fallback_path.read_text(encoding="utf-8") | ||
| query_str = ( | ||
| query_str + "\n" + custom_queries if query_str else custom_queries | ||
| ) | ||
|
|
||
| if query_str: | ||
| return Query(language, query_str) | ||
| except Exception as e: |
There was a problem hiding this comment.
Fallback query is unreachable
The local queries/highlights/*.scm fallback is inside the same try block after importlib.import_module(module_name). When a grammar is loaded from the vendored submodule path instead of an installed tree_sitter_* package, that import fails after the path is removed, so the checked-in highlights query is never read and highlights becomes None. Modifier and decorator extraction silently disables for that language.
Artifacts
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: runtime output showing import failure prevents fallback highlights query loading
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/parser_loader.py
Line: 237-255
Comment:
**Fallback query is unreachable**
The local `queries/highlights/*.scm` fallback is inside the same `try` block after `importlib.import_module(module_name)`. When a grammar is loaded from the vendored submodule path instead of an installed `tree_sitter_*` package, that import fails after the path is removed, so the checked-in highlights query is never read and `highlights` becomes `None`. Modifier and decorator extraction silently disables for that language.
How can I resolve this? If you propose a fix, please make it concise.…nition keywords, strip decorator args, capture Rust sibling attributes
|
Heads up: CI didn't run here — all jobs show "The job was not started because your account is locked due to a billing issue," which looks like a repo-level Actions billing problem rather than anything in this PR. Locally the full unit suite passes (only Windows-specific path/symlink/encoding tests fail, which pass on Linux CI), ruff check + format are clean, and ty check codebase_rag is clean. Happy to re-run once Actions is available. |
Closes #525. Advances #521.
Generalizes access-modifier and decorator extraction from Java-only to a single
shared, highlights.scm-driven path for all languages.
parser_loader.py; populates modifiers: list[str] and decorators: list[str] on
Function/Method/Class nodes (empty list when absent).
shared path, removing the bespoke logic; kept the handler Protocol consistent.
Testing: full unit suite green on CI targets; ruff check + ruff format clean;
ty check codebase_rag (--exclude tests) clean. Local Windows shows only
OS-specific failures (path separators, symlink privileges, cp1252 encoding,
libclang) that pass on Linux CI.