Skip to content

feat(parsers): extract access modifiers and decorators via highlights.scm#566

Open
ChetanyaRathi wants to merge 5 commits into
vitali87:mainfrom
ChetanyaRathi:feat/525-highlights-modifiers-decorators
Open

feat(parsers): extract access modifiers and decorators via highlights.scm#566
ChetanyaRathi wants to merge 5 commits into
vitali87:mainfrom
ChetanyaRathi:feat/525-highlights-modifiers-decorators

Conversation

@ChetanyaRathi

Copy link
Copy Markdown
Contributor

Closes #525. Advances #521.

Generalizes access-modifier and decorator extraction from Java-only to a single
shared, highlights.scm-driven path for all languages.

  • New shared extract_modifiers_and_decorators in parsers/utils.py, loaded via
    parser_loader.py; populates modifiers: list[str] and decorators: list[str] on
    Function/Method/Class nodes (empty list when absent).
  • Refactored the per-language handlers (java, js_ts, php, rust, python) onto the
    shared path, removing the bespoke logic; kept the handler Protocol consistent.
  • Added per-language extraction tests.

Testing: full unit suite green on CI targets; ruff check + ruff format clean;
ty check codebase_rag (--exclude tests) clean. Local Windows shows only
OS-specific failures (path separators, symlink privileges, cp1252 encoding,
libclang) that pass on Linux CI.

Bot added 4 commits July 1, 2026 15:39
…ghlights

- Append custom fallback decorator highlights for languages missing them in upstream tree-sitter packages.

- Expand modifier extraction to check wrapper nodes (like decorated_definition).

- Remove obsolete decorator extraction tests.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors decorator and modifier extraction across multiple languages by replacing language-specific handler methods with a unified utility that leverages tree-sitter highlights queries. It also updates the schema to store modifiers for classes, functions, and methods. The review feedback highlights three key issues: a bug in query loading where a module import failure incorrectly skips fallback queries, noisy modifier extraction that captures definition keywords like def or class, and a parsing failure in _decorator_tail_names when handling decorators with arguments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +236 to +257
try:
module_name = f"{cs.TREE_SITTER_MODULE_PREFIX}{lang_name.replace('-', '_')}"
module = importlib.import_module(module_name)

query_str = ""
if hasattr(module, "HIGHLIGHTS_QUERY"):
query_str = module.HIGHLIGHTS_QUERY

fallback_path = (
Path(__file__).parent / "queries" / "highlights" / f"{lang_name}.scm"
)
if fallback_path.exists():
custom_queries = fallback_path.read_text(encoding="utf-8")
query_str = (
query_str + "\n" + custom_queries if query_str else custom_queries
)

if query_str:
return Query(language, query_str)
except Exception as e:
logger.debug(f"Failed to load highlights query for {lang_name}: {e}")
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If the tree-sitter module import fails (e.g., ModuleNotFoundError or ImportError), the entire try block is aborted, which completely skips loading the fallback highlights query from fallback_path. This is a major bug because fallback queries should still be loaded even if the module import fails (or if the module is loaded from a submodule where the bindings path is no longer in sys.path). Separating the module import/query extraction from the fallback path loading using separate try-except blocks is much more robust and correct.

    query_str = ""
    try:
        module_name = f"{cs.TREE_SITTER_MODULE_PREFIX}{lang_name.replace('-', '_')}"
        module = importlib.import_module(module_name)
        if hasattr(module, "HIGHLIGHTS_QUERY"):
            query_str = module.HIGHLIGHTS_QUERY
    except Exception as e:
        logger.debug(f"Failed to import tree-sitter module for {lang_name}: {e}")

    try:
        fallback_path = (
            Path(__file__).parent / "queries" / "highlights" / f"{lang_name}.scm"
        )
        if fallback_path.exists():
            custom_queries = fallback_path.read_text(encoding="utf-8")
            query_str = (
                query_str + "\n" + custom_queries if query_str else custom_queries
            )

        if query_str:
            return Query(language, query_str)
    except Exception as e:
        logger.debug(f"Failed to load highlights query for {lang_name}: {e}")
    return None

Comment on lines +90 to +125
def extract_modifiers_and_decorators(
node: ASTNode, lang_queries: LanguageQueries
) -> tuple[list[str], list[str]]:
query = lang_queries.get("highlights")
if not query:
return [], []

cursor = get_query_cursor(query)

body_node = node.child_by_field_name("body")
header_end_byte = body_node.start_byte if body_node else node.end_byte

target_node = node
if node.parent and node.parent.type in ("decorated_definition", "export_statement"):
target_node = node.parent

cursor.set_byte_range(target_node.start_byte, header_end_byte)

captures = sorted_captures(cursor, target_node)

modifiers: list[str] = []
decorators: list[str] = []

for name, nodes in captures.items():
if name.startswith("keyword.modifier") or name == "keyword":
for n in nodes:
text = safe_decode_text(n)
if text and text not in modifiers:
modifiers.append(text)
elif name.startswith("attribute") or name.startswith("function.decorator"):
for n in nodes:
text = safe_decode_text(n)
if text and text not in decorators:
decorators.append(text)

return modifiers, decorators

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code extracts any keyword captured as @keyword or @keyword.modifier as a modifier. However, standard highlights queries capture definition keywords (like def, class, fn, struct, impl, interface, enum, function, trait, type) and literal keywords (like None, True, False, null, true, false, void) as @keyword. This results in noisy and incorrect modifiers (e.g., modifiers=["def"] for Python functions, modifiers=["class"] for Python classes, modifiers=["fn"] for Rust functions, etc.). We should define a set of excluded keywords and filter them out when extracting modifiers.

_EXCLUDED_KEYWORDS = frozenset({
    "def", "class", "fn", "struct", "impl", "interface", "enum", "function", "trait", "type", "void",
    "None", "True", "False", "null", "true", "false", "return", "import", "from", "as", "where"
})


def extract_modifiers_and_decorators(
    node: ASTNode, lang_queries: LanguageQueries
) -> tuple[list[str], list[str]]:
    query = lang_queries.get("highlights")
    if not query:
        return [], []

    cursor = get_query_cursor(query)

    body_node = node.child_by_field_name("body")
    header_end_byte = body_node.start_byte if body_node else node.end_byte

    target_node = node
    if node.parent and node.parent.type in ("decorated_definition", "export_statement"):
        target_node = node.parent

    cursor.set_byte_range(target_node.start_byte, header_end_byte)

    captures = sorted_captures(cursor, target_node)

    modifiers: list[str] = []
    decorators: list[str] = []

    for name, nodes in captures.items():
        if name.startswith("keyword.modifier") or name == "keyword":
            for n in nodes:
                text = safe_decode_text(n)
                if text and text not in modifiers and text not in _EXCLUDED_KEYWORDS:
                    modifiers.append(text)
        elif name.startswith("attribute") or name.startswith("function.decorator"):
            for n in nodes:
                text = safe_decode_text(n)
                if text and text not in decorators:
                    decorators.append(text)

    return modifiers, decorators

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 8191a09

Comment on lines 155 to 159
def _decorator_tail_names(decorators: list[str]) -> set[str]:
return {
decorator.lstrip(cs.DECORATOR_AT).split(cs.SEPARATOR_DOT)[-1]
decorator.lstrip("@#[]() ").split(cs.SEPARATOR_DOT)[-1].rstrip(")]")
for decorator in decorators
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code decorator.lstrip("@#[]() ").split(cs.SEPARATOR_DOT)[-1].rstrip(")]") fails to correctly extract the bare name of decorators with arguments (e.g., @cached_property(ttl=3600) or @Component({selector: 'app-root'})). For @cached_property(ttl=3600), it returns cached_property(ttl=3600, which fails to match cached_property in cs.PROPERTY_DECORATORS. We should split on ( first to strip any arguments before extracting the tail name.

Suggested change
def _decorator_tail_names(decorators: list[str]) -> set[str]:
return {
decorator.lstrip(cs.DECORATOR_AT).split(cs.SEPARATOR_DOT)[-1]
decorator.lstrip("@#[]() ").split(cs.SEPARATOR_DOT)[-1].rstrip(")]")
for decorator in decorators
}
def _decorator_tail_names(decorators: list[str]) -> set[str]:
return {
decorator.lstrip("@#[]() ").split("(")[0].split(cs.SEPARATOR_DOT)[-1].rstrip(")] ")
for decorator in decorators
}
References
  1. When parsing decorators, annotations, or attributes, extract the full text including arguments, not just the name. This preserves crucial semantic information (e.g., arguments in @RequestMapping(value="/api") or #[derive(Debug)]) for RAG queries and ensures consistency across all supported languages (Python, Java, Rust, TypeScript).

@greptile-apps

greptile-apps Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR moves modifier and decorator extraction into a shared highlights-query path. The main changes are:

  • Adds highlights queries to parser loading and language query metadata.
  • Populates modifiers and decorators on function, method, and class nodes.
  • Removes per-language handler decorator extraction.
  • Adds highlight query files for Java, JavaScript, PHP, Python, Rust, and TypeScript.
  • Updates tests and semantic search metadata for the new fields.

Confidence Score: 3/5

The shared extraction refactor has targeted regressions that can silently drop decorator metadata for supported languages.

The implementation is broadly covered by language tests, but the highlighted code paths leave Rust sibling attributes and vendored-grammar fallback highlights unhandled.

codebase_rag/parsers/utils.py and codebase_rag/parser_loader.py

T-Rex T-Rex Logs

What T-Rex did

  • Reproduced the Rust attribute ingestion issue by running a focused ingestion script against a generated Rust crate; the run loaded the real Rust tree-sitter parser and ingested lib.rs, showing that Foo and function_attr were preserved while method_attr had an empty decorators array, and the repro assertion failed.
  • Reproduced that the fallback highlights query is unreachable when tree_sitter_python import fails, even though the local fallback exists; a focused Python repro attempted to monkeypatch the parser loading and the process exited with code 1 after demonstrating the failure.
  • Compared decorator extraction across pre- and post-runs, finding that tsFunc and tsMethod have empty decorators while Java/PHP nodes contain nested duplicate decorator entries in the results.
  • Compared modifier extraction results for Java/PHP and PHP, observing after-run modifiers_present=true for relevant nodes, and noting explicit modifiers for plain classes and PHP noModFunc, which at times appear in ways that differ from the empty-list expectation.

View all artifacts

T-Rex Ran code and verified through T-Rex

Comments Outside Diff (3)

  1. General comment

    P2 TypeScript function and method decorators are not extracted

    • Bug
      • The head run creates TypeScript Function node tsFunc and Method node tsMethod, but both have decorators: [] despite source decorators @dec and @dec2(...). The decorated TypeScript class is populated, so this is a partial node-kind contract failure for TypeScript Function/Method nodes.
    • Cause
      • The shared extract_modifiers_and_decorators path only widens the extraction target for parents of type decorated_definition or export_statement. In the TypeScript grammar, decorators for methods/functions are not captured within the byte range/node shape used for those node kinds, so codebase_rag/queries/highlights/typescript.scm capture (decorator) @function.decorator is not sufficient for these Function/Method nodes.
    • Fix
      • Adjust the TypeScript extraction path to include the actual decorated parent/wrapper for method/function declarations, or otherwise extend the node/range selection in codebase_rag/parsers/utils.py and/or TypeScript handler logic so decorators preceding TypeScript functions and methods are included in the highlights query range. Add ingestion tests that assert TypeScript class, method, function, and undecorated nodes' exact decorators values.

    T-Rex Ran code and verified through T-Rex

  2. General comment

    P1 Java and PHP decorator lists include duplicate nested captures

    • Bug
      • Head populates Java/PHP decorator fields, but not accurately as list[str]: Java class/method nodes include both full annotations and inner identifiers, e.g. ['@Ann', '@Ann2("cls")', 'Ann', 'Ann2']; PHP class/function/method nodes include both full attribute groups and inner attributes, e.g. ['#[Attr]', 'Attr', "#[Attr('func')]", "Attr('func')"]. This violates the user-visible contract for accurate decorator lists.
    • Cause
      • The highlights queries capture both outer and nested decorator/attribute nodes (java.scm captures both marker_annotation and annotation; php.scm captures both attribute_group and attribute), and extract_modifiers_and_decorators appends every unique captured text without filtering contained captures or choosing a single canonical representation.
    • Fix
      • Make decorator extraction de-duplicate nested captures by span containment or query only canonical outer decorator nodes. For Java, keep full annotation text only; for PHP, choose a stable representation (prefer full #[...] attribute group text if that is the public contract) and suppress contained attribute captures. Add exact-value ingestion tests for Java/PHP Class, Function where applicable, and Method nodes.

    T-Rex Ran code and verified through T-Rex

  3. General comment

    P1 Head records class/function syntax keywords as modifiers instead of empty lists when no modifiers are present

    • Bug
      • In the head ingestion output, nodes with no actual access/language modifiers still receive non-empty modifiers lists containing declaration keywords. The Java plain class has modifiers ["class"], the PHP plain class has modifiers ["class"], and the PHP method declared as function noModFunc() has modifiers ["function"]. The contract under validation says Function, Method, and Class nodes should have modifiers: list[str] with empty lists when absent, generalizing access-modifier extraction. These values are not modifiers and prevent consumers from distinguishing an unmodified declaration from one with real modifiers.
    • Cause
      • The shared highlights.scm-driven extraction path appears to accept broad highlight captures for declaration keywords, not just actual modifier/access-modifier captures, so structural syntax tokens captured from highlights are stored in the node modifiers property.
    • Fix
      • Filter extracted modifier captures to real modifier tokens for each language/node kind, or refine the highlights/local query captures so class and function keywords are not returned as modifiers. Add regression coverage asserting that plain Java/PHP classes and PHP methods without access/static/final modifiers produce modifiers: [].

    T-Rex Ran code and verified through T-Rex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
codebase_rag/parsers/utils.py:102-106
**Rust attributes are skipped**
For Rust, outer attributes like `#[test]` and `#[derive(Debug)]` are sibling `attribute_item` nodes before the function or class node, not children of it. This range starts at `node.start_byte` and queries only `target_node`, so the shared extractor never sees those siblings. Rust methods and classes now get empty `decorators` where the removed handler walked `prev_named_sibling`.

### Issue 2 of 2
codebase_rag/parser_loader.py:237-255
**Fallback query is unreachable**
The local `queries/highlights/*.scm` fallback is inside the same `try` block after `importlib.import_module(module_name)`. When a grammar is loaded from the vendored submodule path instead of an installed `tree_sitter_*` package, that import fails after the path is removed, so the checked-in highlights query is never read and `highlights` becomes `None`. Modifier and decorator extraction silently disables for that language.

Reviews (1): Last reviewed commit: "fix(parsers): make extract_decorators pr..." | Re-trigger Greptile

Comment thread codebase_rag/parsers/utils.py Outdated
Comment on lines +102 to +106
target_node = node
if node.parent and node.parent.type in ("decorated_definition", "export_statement"):
target_node = node.parent

cursor.set_byte_range(target_node.start_byte, header_end_byte)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Rust attributes are skipped
For Rust, outer attributes like #[test] and #[derive(Debug)] are sibling attribute_item nodes before the function or class node, not children of it. This range starts at node.start_byte and queries only target_node, so the shared extractor never sees those siblings. Rust methods and classes now get empty decorators where the removed handler walked prev_named_sibling.

Artifacts

Repro: focused Rust attribute ingestion script

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: failing parser ingestion output showing empty method decorators

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/parsers/utils.py
Line: 102-106

Comment:
**Rust attributes are skipped**
For Rust, outer attributes like `#[test]` and `#[derive(Debug)]` are sibling `attribute_item` nodes before the function or class node, not children of it. This range starts at `node.start_byte` and queries only `target_node`, so the shared extractor never sees those siblings. Rust methods and classes now get empty `decorators` where the removed handler walked `prev_named_sibling`.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +237 to +255
module_name = f"{cs.TREE_SITTER_MODULE_PREFIX}{lang_name.replace('-', '_')}"
module = importlib.import_module(module_name)

query_str = ""
if hasattr(module, "HIGHLIGHTS_QUERY"):
query_str = module.HIGHLIGHTS_QUERY

fallback_path = (
Path(__file__).parent / "queries" / "highlights" / f"{lang_name}.scm"
)
if fallback_path.exists():
custom_queries = fallback_path.read_text(encoding="utf-8")
query_str = (
query_str + "\n" + custom_queries if query_str else custom_queries
)

if query_str:
return Query(language, query_str)
except Exception as e:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Fallback query is unreachable
The local queries/highlights/*.scm fallback is inside the same try block after importlib.import_module(module_name). When a grammar is loaded from the vendored submodule path instead of an installed tree_sitter_* package, that import fails after the path is removed, so the checked-in highlights query is never read and highlights becomes None. Modifier and decorator extraction silently disables for that language.

Artifacts

Repro: focused script that simulates unavailable tree_sitter_python while local highlights fallback exists

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: runtime output showing import failure prevents fallback highlights query loading

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/parser_loader.py
Line: 237-255

Comment:
**Fallback query is unreachable**
The local `queries/highlights/*.scm` fallback is inside the same `try` block after `importlib.import_module(module_name)`. When a grammar is loaded from the vendored submodule path instead of an installed `tree_sitter_*` package, that import fails after the path is removed, so the checked-in highlights query is never read and `highlights` becomes `None`. Modifier and decorator extraction silently disables for that language.

How can I resolve this? If you propose a fix, please make it concise.

…nition keywords, strip decorator args, capture Rust sibling attributes
@ChetanyaRathi ChetanyaRathi changed the title Feat/525 highlights modifiers decorators feat(parsers): extract access modifiers and decorators via highlights.scm Jul 1, 2026
@ChetanyaRathi

Copy link
Copy Markdown
Contributor Author

Heads up: CI didn't run here — all jobs show "The job was not started because your account is locked due to a billing issue," which looks like a repo-level Actions billing problem rather than anything in this PR. Locally the full unit suite passes (only Windows-specific path/symlink/encoding tests fail, which pass on Linux CI), ruff check + format are clean, and ty check codebase_rag is clean. Happy to re-run once Actions is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

feat: extract access modifiers and decorators via highlights.scm queries

1 participant