Skip to content

Conversation

@tmihalac
Copy link

@tmihalac tmihalac commented Oct 5, 2025

No description provided.

@tmihalac tmihalac requested a review from zvigrinberg October 5, 2025 14:46
@tmihalac tmihalac marked this pull request as ready for review October 5, 2025 14:46
Copy link
Collaborator

@zvigrinberg zvigrinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac
Great Job!, It looks quite good, although there are parts that are not that readable.... ( especially in the java segmenter of methods and java language parser, but for the fairness i'd say it's extermely detailed and covering a lot of cases and edge cases)
Please add 2-4 UT tests functions to protect your logic from regressions, re-use the following 2 functions that i've created for the sake of consistency in checking ( it's UT tests and semi Integration test - checking integration of a given ecosystem with all the module of transitive code search tool

def set_input_for_next_run(git_repository: str, git_ref: str, included_extensions: list[str],
excluded_extensions: list[str]):
source_code_info = [SourceDocumentsInfo(type='code', git_repo=git_repository,
ref=git_ref, include=included_extensions,
exclude=excluded_extensions)]
sbom_info_input = ManualSBOMInfoInput(packages=[SBOMPackage(name="a", version="1.0", system="blabla")])
morpheus_input = AgentMorpheusInput(image=ImageInfoInput(source_info=source_code_info,
sbom_info=sbom_info_input,
analysis_type=AnalysisType.IMAGE),
scan=ScanInfoInput(vulns=[VulnInfo(vuln_id="CVE-2025-1234")]))
engine_input = AgentMorpheusEngineInput(input=morpheus_input, info=AgentMorpheusInfo())
state: AgentMorpheusEngineState = AgentMorpheusEngineState(original_input=engine_input,
code_vdb_path="",
doc_vdb_path="",
code_index_path="",
cve_intel=[])
ctx_state.set(state)
async def get_transitive_code_runner_function():
transitive_code_search = transitive_search(config=TransitiveCodeSearchToolConfig(), builder=None)
async for function in transitive_code_search.gen:
return function.single_fn

And this is an example of go UT that uses these 2 functions:

async def test_transitive_search_golang_4():
transitive_code_search_runner_coroutine = await get_transitive_code_runner_function()
set_input_for_next_run(git_repository="https://github.com/openshift/oc-mirror",
git_ref="b137a53a5360a41a70432ea2bfc98a6cee6f7a4a",
included_extensions=["**/*.go"],
excluded_extensions=["go.mod", "go.sum", "**/*test*.go"])
result = await transitive_code_search_runner_coroutine("github.com/mholt/archiver,Unarchive")
(path_found, list_path) = result
assert path_found is True
assert len(list_path) == 2

Please prepare 2-4 test functions with inputs related to java.
In addition, Please see my comments.

Thanks!.

# Extract
methods = _extract_methods_anywhere(java_source, 0, len(java_source))
# Lambdas in a separate pass across the whole file (so we don't double-count)
lambdas = _extract_lambdas_in_range(java_source, 0, len(java_source))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Eventually you included lambdas in the feature? ( it seems that the segmenter parse them out from sources into documents)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambdas need more work, currently the parameter types are not derived yet and its not handled yet in the search logic I've opened a new jira for it https://issues.redhat.com/browse/APPENG-3848

@@ -0,0 +1,3671 @@
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac This file is an enormous one and is very hard to maintain. ( not likely to be easily manageable) - almost 4k lines of code.
i think, that all functions that are used internally and methods( should be transformed into functions before that), that are not part of the LanguageParser API should be moved to java_utils module (you already created this file module).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Would even split all of the util functions between two utils files modules (java_utils and another one)
Maybe according to function' type/kind/how much the code of it is "low level"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree its pretty overwhelming but even splitting it to 2 files will still give 2 files of roughly 2K in addition its not so simple as some of the internal methods relay on other public parser methods which causes cyclic dependencies

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but i barely can write comments next to lines of code in this file because it's so big ( it stuck and takes a lot of time just to mark the portion for the comment), so if i can't even effectively review it and write comments, then it's a problem.
Even more severe - Maintainability is very impacted, and also readability - It's very hard to troubleshoot a potential bug with such a huge file, even with debugger.
try to split it to another 2-3 files, plus this file itself, then it would keep each one of them more manageable and approximately ~1K lines of code each.
About closures ( internal functions) , you should decouple them from the outer methods if possible , but if not possible, because they're using the outer methods, move them together with the outer scope ( the method) outside ( if they're using the outer scope vars), after turning the outer method to function.
the closures ( internal functions) of a method in the class cannot access other methods in that class ( unless it's a static method or class method, which is another story, or that you explicitly expose the instance in the method' scope as a var), they can just access the scope of the containing method and other closures in the same depth ( siblings) and that's it, so an internal/private method that is not part of the LanguageParser API, can be transformed to a function, and moved/migrated to another file , together with all of its closures.

Comment on lines 90 to 105
if not exist:
self.documents_of_functions = [doc for doc in tqdm(self.documents, total=len(self.documents), desc="Filtering documents, leaving functions/methods ...")
if self.language_parser.is_function(doc)]
save_to_cache(PICKLE_CACHE_DIRECTORY, code_source_info.git_repo, code_source_info.ref, self.documents_of_functions ,"documents_of_functions")
else:
self.documents_of_functions = documents_of_functions

documents_of_types, exist = retrieve_from_cache(PICKLE_CACHE_DIRECTORY, code_source_info.git_repo, code_source_info.ref, "documents_of_types")
# filter out full code documents and functions/methods docs, retaining only types/classes docs
if not exist:
self.documents_of_types = [doc for doc in tqdm(self.documents, total=len(self.documents) , desc="Filtering documents, leaving types/classes ...")
if self.language_parser.is_type(doc)]
save_to_cache(PICKLE_CACHE_DIRECTORY, code_source_info.git_repo, code_source_info.ref, self.documents_of_types ,"documents_of_types")
else:
self.documents_of_types = documents_of_types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Why are you calling save_to_cache twice?
Why is the separation of the types/classes and methods is needed here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i understood correctly, based on the existing logic, you will end up with 3 pickles files, one for all document, one of the java classes and one for the java methods.

But you're anyway getting these from the unified list of documents ( classes + methods).
Is it saving you a lot of time ( does the list comprehension is a costy/time expensive in your test cases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will end up with a few pickle files, the list comprehensions all together take around 5 - 10 min, each time but its enough for them to be done once and reused each time thats why I save them to cache as seperate files
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list comprehension takes 5-10 minutes for each kind?
How many documents are in the list can you remind me?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it takes all of them together 5-10 min and its being run multiple time by the transitive search tool

Comment on lines 1 to 10
import os
import re
from dataclasses import dataclass
from typing import Dict, Tuple, Any, List, Optional
from tqdm import tqdm

from langchain_core.documents import Document

from .lang_functions_parsers import LanguageFunctionsParser
from ..java_utils import extract_jar_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac You should add agent logger and use it across the implementation wherever applicable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see the need of it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With such a huge file ( even after refactoring and moving out most of the logic, it will remain 500-1k lines of code) i would suggest adding such an agent logger and add some logs across the main logic of it and wherever applicable.

def is_package_imported(self, code_content: str, identifier: str, callee_package: str = "") -> bool:
pass

def extract_method_name_with_params(self, java_src: str) -> str | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Parameter name should be independent of ecosystem

Suggested change
def extract_method_name_with_params(self, java_src: str) -> str | None:
def extract_method_name_with_params(self, src: str) -> str | None:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your right, missed it in refactor

return True
return False

def extract_method_name_with_params(self, java_src: str) -> str | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def extract_method_name_with_params(self, java_src: str) -> str | None:
def extract_method_name_with_params(self, golang_src: str) -> str | None:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your right, missed it in refactor

Comment on lines 100 to 122
@pytest.mark.asyncio
async def test_transitive_search_java_1():
transitive_code_search_runner_coroutine = await get_transitive_code_runner_function()
set_input_for_next_run(git_repository="https://github.com/cryostatio/cryostat",
git_ref="8f753753379e9381429b476aacbf6890ef101438",
included_extensions=["**/*.java"],
excluded_extensions=["target/**/*",
"build/**/*",
"*.class",
".gradle/**/*",
".mvn/**/*",
".gitignore",
"test/**/*",
"tests/**/*",
"src/test/**/*",
"pom.xml",
"build.gradle"])
result = await transitive_code_search_runner_coroutine("commons-beanutils:commons-beanutils:1.9.4,org.apache.commons.beanutils.PropertyUtilsBean.getProperty")
(path_found, list_path) = result
print(result)
assert path_found is False
assert len(list_path) is 1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Great!
If you are having more, try to add at least more 1-2 inputs to protect your logic from future changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is I don't have more examples, Ronit is working on trying to provide more

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac You don't have to wait for him...
Use other repos, and check for methods that are actually being called, and assert for the path length and for the path_found expression ( the second time), assert the boolean result against true
It Doesn't necessarily has to be vulnerable methods, you're only checking your logic of chain of calls
You can even give the same repo with other reachable method of an artifact ( either transitive or direct one).

@tmihalac tmihalac force-pushed the java-transitive-search branch from e9ee7f8 to 15e1ccf Compare October 29, 2025 17:50
@tmihalac tmihalac requested a review from zvigrinberg October 29, 2025 17:54
@tmihalac tmihalac force-pushed the java-transitive-search branch from f5fa123 to bc89d72 Compare October 30, 2025 10:37
@tmihalac
Copy link
Author

tmihalac commented Oct 30, 2025

/retest

2 similar comments
@tmihalac
Copy link
Author

/retest

@zvigrinberg
Copy link
Collaborator

/retest

@tmihalac tmihalac force-pushed the java-transitive-search branch 4 times, most recently from 8500b8d to 0197565 Compare November 2, 2025 13:10
Copy link
Collaborator

@zvigrinberg zvigrinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Thanks for your good and hard work.
Please see my comments, we need couple of changes ( nothing critical though).
also, after all changes, please run flows in all ecosystems end to end ( we need to do it until Integration tests implementation WIP is done).

Comment on lines 70 to 74
coc_retriever = get_chain_of_calls_retriever(ecosystem,
documents,
git_repo,
query,
code_source_info)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Please use kw to pass the args instead of positional args, because it's very confusing that way.

ecosystem = file.read()
ecosystem = Ecosystem[ecosystem]
coc_retriever = ChainOfCallsRetriever(documents=documents, ecosystem=ecosystem, manifest_path=git_repo)
ecosystem = TransitiveCodeSearcher.get_ecosystem(git_repo)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac The Problem with this is returning the ecosystem even if the download dependencies' install dependencies method is failing ( or getting the tree builder for the "found ecosystem"), and thus in such case it can continue all the way to do the transitive search even without 3rd party libs source codes , which is wrong.
in current logic, it's only save it to the file in case it completes both of the getting the tree builder, and installing the dependencies...
Please revert

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting

Comment on lines 64 to 66
ecosystem: Ecosystem = TransitiveCodeSearcher.get_ecosystem(git_repo_path)
if ecosystem is None:
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac The Problem with this is returning the ecosystem even if the download dependencies' install dependencies method is failing ( or getting the tree builder for the "found ecosystem"), and thus in such case it can continue all the way to do the transitive search even without 3rd party libs source codes , which is wrong.
in current logic, it's only save it to the file in case it completes both of the getting the tree builder, and installing the dependencies...
Please revert

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting

Comment on lines 31 to 33
def is_doc_type(self, doc: Document) -> bool:
return doc.page_content.startswith("type")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac This is incorrect, the type in python is class
Why did you change it ?

def get_type_reserved_word(self) -> str:
return "class"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac In addition, you should also instead method

Suggested change
def is_doc_type(self, doc: Document) -> bool:
return doc.page_content.startswith("type")
def is_doc_type(self, doc: Document) -> bool:
return doc.page_content.startswith(self.get_type_reserved_word())

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right

Comment on lines +393 to +406
def is_package_imported(self, code_content: str, identifier: str, callee_package: str = "") -> bool:
for line in code_content.split(os.linesep):
if not self.is_comment_line(line):
if 'import' in line and callee_package in line:
if identifier:
if identifier == callee_package:
if all(word in line for word in ['import', callee_package]):
return True
else:
if all(word in line for word in ['import', callee_package, 'as', identifier]) \
or all(word in line for word in ['from', callee_package, 'import', identifier]):
return True
else:
if all(word in line for word in ['from', callee_package, 'import']):
return True
return False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac This one that you passing the code str instead of the document, and refactoring of callers were a good decision, it was simplified the readability of the code

i += 1
return n - 1

def _match_balanced_braces(src: str, i: int) -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac please document function purpose succinctly , what it gets as input, what it does, and what is the integer that it returns to callers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

i -= 1
return 0

def _match_paren_backwards(src: str, i: int) -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Function intention and purpose is not quite clear:
please document function purpose succinctly , what it gets as input, what it does, and what is the integer that it returns to callers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

i2 += 1
return ""

def _enum_constants_terminator(reg: _TypeRegion) -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Function intention and purpose is not quite clear:
please document function purpose succinctly , what it gets as input, what it does, and what is the integer that it returns to callers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


return s

def convert_to_maven_artifact(package_name: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac This is only used internally here, please refactor to start with underscore for a better readability and to reflect intention.
If you think that you're going to use it outside of this module in the future, you can leave it as is...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

</tool_identity>
<requires>
Input format: 'package_name,function_name' or 'package_name,class_name.method_name'
Input format: 'package_name,function_name' or 'package_name,class_name.method_name' or maven_gav(groupId:artifactId:version),fully_qualified_class_name.method_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Did you try out the new package and function locator tool integration with JAVA?
is it working as expected?
This is something that needed to be checked manually, as we still haven't implemented integration tests yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I test it, isn't it called all the time ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to be working correctly, but haven't tried many inputs because I don't have

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi OK.
All i'm asking is if it works as before or as good as it was before, with this new tool description.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For java it looks to be working the same, for go, c, python hard for me to test

@tmihalac tmihalac force-pushed the java-transitive-search branch from 4ede2b8 to 4f62893 Compare November 3, 2025 12:49
Signed-off-by: Theodor Mihalache <[email protected]>
@tmihalac tmihalac force-pushed the java-transitive-search branch from 4f62893 to 5ba4d90 Compare November 9, 2025 13:04
Signed-off-by: Theodor Mihalache <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]>
@zvigrinberg
Copy link
Collaborator

/retest vulnerability-analysis-on-pr

1 similar comment
@tmihalac
Copy link
Author

tmihalac commented Nov 9, 2025

/retest vulnerability-analysis-on-pr

@tmihalac tmihalac requested a review from zvigrinberg November 18, 2025 14:36
@tmihalac tmihalac force-pushed the java-transitive-search branch from 134dca9 to 6e8f9b3 Compare November 19, 2025 09:27
tmihalac and others added 4 commits November 19, 2025 11:33
Signed-off-by: Theodor Mihalache <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]>
…a-transitive-search

# Conflicts:
#	src/vuln_analysis/tools/transitive_code_search.py
@tmihalac
Copy link
Author

/retest vulnerability-analysis-on-pr

Copy link
Collaborator

@zvigrinberg zvigrinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Good Job.
Please Take a look on my comments , most of them are minor ones.

Thanks.

while True:
while i < n and s[i].isspace():
i += 1
if i < n and s[i] == '@':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Please replace @ with a constant JAVA_ANNOTATION_SYMBOL

out.append(x)
return out

def tup_dedup_keep_order(xs: List[Tuple[str, str]]) -> List[Tuple[str, str]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Is it possible to use a more meaningful argument name ? ( what is xs?)
In addition, would be better to document this function.

def simple_name_from_fqcn(fq: str) -> str:
return fq.split('.')[-1] if fq else fq

def dedup_keep_order(xs: List[str]) -> List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmihalac Is it possible to use a more meaningful argument name ? ( what is xs?)
In addition, would be better to document this function.

@tmihalac tmihalac requested a review from zvigrinberg November 24, 2025 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants