-
Notifications
You must be signed in to change notification settings - Fork 9
Java transitive search #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: rh-aiq-main
Are you sure you want to change the base?
Java transitive search #130
Conversation
zvigrinberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac
Great Job!, It looks quite good, although there are parts that are not that readable.... ( especially in the java segmenter of methods and java language parser, but for the fairness i'd say it's extermely detailed and covering a lot of cases and edge cases)
Please add 2-4 UT tests functions to protect your logic from regressions, re-use the following 2 functions that i've created for the sake of consistency in checking ( it's UT tests and semi Integration test - checking integration of a given ecosystem with all the module of transitive code search tool
vulnerability-analysis/src/vuln_analysis/tools/test_transitive_code_search.py
Lines 100 to 122 in 3481519
| def set_input_for_next_run(git_repository: str, git_ref: str, included_extensions: list[str], | |
| excluded_extensions: list[str]): | |
| source_code_info = [SourceDocumentsInfo(type='code', git_repo=git_repository, | |
| ref=git_ref, include=included_extensions, | |
| exclude=excluded_extensions)] | |
| sbom_info_input = ManualSBOMInfoInput(packages=[SBOMPackage(name="a", version="1.0", system="blabla")]) | |
| morpheus_input = AgentMorpheusInput(image=ImageInfoInput(source_info=source_code_info, | |
| sbom_info=sbom_info_input, | |
| analysis_type=AnalysisType.IMAGE), | |
| scan=ScanInfoInput(vulns=[VulnInfo(vuln_id="CVE-2025-1234")])) | |
| engine_input = AgentMorpheusEngineInput(input=morpheus_input, info=AgentMorpheusInfo()) | |
| state: AgentMorpheusEngineState = AgentMorpheusEngineState(original_input=engine_input, | |
| code_vdb_path="", | |
| doc_vdb_path="", | |
| code_index_path="", | |
| cve_intel=[]) | |
| ctx_state.set(state) | |
| async def get_transitive_code_runner_function(): | |
| transitive_code_search = transitive_search(config=TransitiveCodeSearchToolConfig(), builder=None) | |
| async for function in transitive_code_search.gen: | |
| return function.single_fn |
And this is an example of go UT that uses these 2 functions:
vulnerability-analysis/src/vuln_analysis/tools/test_transitive_code_search.py
Lines 59 to 69 in 3481519
| async def test_transitive_search_golang_4(): | |
| transitive_code_search_runner_coroutine = await get_transitive_code_runner_function() | |
| set_input_for_next_run(git_repository="https://github.com/openshift/oc-mirror", | |
| git_ref="b137a53a5360a41a70432ea2bfc98a6cee6f7a4a", | |
| included_extensions=["**/*.go"], | |
| excluded_extensions=["go.mod", "go.sum", "**/*test*.go"]) | |
| result = await transitive_code_search_runner_coroutine("github.com/mholt/archiver,Unarchive") | |
| (path_found, list_path) = result | |
| assert path_found is True | |
| assert len(list_path) == 2 |
Please prepare 2-4 test functions with inputs related to java.
In addition, Please see my comments.
Thanks!.
| # Extract | ||
| methods = _extract_methods_anywhere(java_source, 0, len(java_source)) | ||
| # Lambdas in a separate pass across the whole file (so we don't double-count) | ||
| lambdas = _extract_lambdas_in_range(java_source, 0, len(java_source)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Eventually you included lambdas in the feature? ( it seems that the segmenter parse them out from sources into documents)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lambdas need more work, currently the parameter types are not derived yet and its not handled yet in the search logic I've opened a new jira for it https://issues.redhat.com/browse/APPENG-3848
| @@ -0,0 +1,3671 @@ | |||
| import os | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac This file is an enormous one and is very hard to maintain. ( not likely to be easily manageable) - almost 4k lines of code.
i think, that all functions that are used internally and methods( should be transformed into functions before that), that are not part of the LanguageParser API should be moved to java_utils module (you already created this file module).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I Would even split all of the util functions between two utils files modules (java_utils and another one)
Maybe according to function' type/kind/how much the code of it is "low level"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree its pretty overwhelming but even splitting it to 2 files will still give 2 files of roughly 2K in addition its not so simple as some of the internal methods relay on other public parser methods which causes cyclic dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but i barely can write comments next to lines of code in this file because it's so big ( it stuck and takes a lot of time just to mark the portion for the comment), so if i can't even effectively review it and write comments, then it's a problem.
Even more severe - Maintainability is very impacted, and also readability - It's very hard to troubleshoot a potential bug with such a huge file, even with debugger.
try to split it to another 2-3 files, plus this file itself, then it would keep each one of them more manageable and approximately ~1K lines of code each.
About closures ( internal functions) , you should decouple them from the outer methods if possible , but if not possible, because they're using the outer methods, move them together with the outer scope ( the method) outside ( if they're using the outer scope vars), after turning the outer method to function.
the closures ( internal functions) of a method in the class cannot access other methods in that class ( unless it's a static method or class method, which is another story, or that you explicitly expose the instance in the method' scope as a var), they can just access the scope of the containing method and other closures in the same depth ( siblings) and that's it, so an internal/private method that is not part of the LanguageParser API, can be transformed to a function, and moved/migrated to another file , together with all of its closures.
| if not exist: | ||
| self.documents_of_functions = [doc for doc in tqdm(self.documents, total=len(self.documents), desc="Filtering documents, leaving functions/methods ...") | ||
| if self.language_parser.is_function(doc)] | ||
| save_to_cache(PICKLE_CACHE_DIRECTORY, code_source_info.git_repo, code_source_info.ref, self.documents_of_functions ,"documents_of_functions") | ||
| else: | ||
| self.documents_of_functions = documents_of_functions | ||
|
|
||
| documents_of_types, exist = retrieve_from_cache(PICKLE_CACHE_DIRECTORY, code_source_info.git_repo, code_source_info.ref, "documents_of_types") | ||
| # filter out full code documents and functions/methods docs, retaining only types/classes docs | ||
| if not exist: | ||
| self.documents_of_types = [doc for doc in tqdm(self.documents, total=len(self.documents) , desc="Filtering documents, leaving types/classes ...") | ||
| if self.language_parser.is_type(doc)] | ||
| save_to_cache(PICKLE_CACHE_DIRECTORY, code_source_info.git_repo, code_source_info.ref, self.documents_of_types ,"documents_of_types") | ||
| else: | ||
| self.documents_of_types = documents_of_types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Why are you calling save_to_cache twice?
Why is the separation of the types/classes and methods is needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If i understood correctly, based on the existing logic, you will end up with 3 pickles files, one for all document, one of the java classes and one for the java methods.
But you're anyway getting these from the unified list of documents ( classes + methods).
Is it saving you a lot of time ( does the list comprehension is a costy/time expensive in your test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list comprehension takes 5-10 minutes for each kind?
How many documents are in the list can you remind me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it takes all of them together 5-10 min and its being run multiple time by the transitive search tool
| import os | ||
| import re | ||
| from dataclasses import dataclass | ||
| from typing import Dict, Tuple, Any, List, Optional | ||
| from tqdm import tqdm | ||
|
|
||
| from langchain_core.documents import Document | ||
|
|
||
| from .lang_functions_parsers import LanguageFunctionsParser | ||
| from ..java_utils import extract_jar_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac You should add agent logger and use it across the implementation wherever applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't see the need of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With such a huge file ( even after refactoring and moving out most of the logic, it will remain 500-1k lines of code) i would suggest adding such an agent logger and add some logs across the main logic of it and wherever applicable.
| def is_package_imported(self, code_content: str, identifier: str, callee_package: str = "") -> bool: | ||
| pass | ||
|
|
||
| def extract_method_name_with_params(self, java_src: str) -> str | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Parameter name should be independent of ecosystem
| def extract_method_name_with_params(self, java_src: str) -> str | None: | |
| def extract_method_name_with_params(self, src: str) -> str | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your right, missed it in refactor
| return True | ||
| return False | ||
|
|
||
| def extract_method_name_with_params(self, java_src: str) -> str | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def extract_method_name_with_params(self, java_src: str) -> str | None: | |
| def extract_method_name_with_params(self, golang_src: str) -> str | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your right, missed it in refactor
| @pytest.mark.asyncio | ||
| async def test_transitive_search_java_1(): | ||
| transitive_code_search_runner_coroutine = await get_transitive_code_runner_function() | ||
| set_input_for_next_run(git_repository="https://github.com/cryostatio/cryostat", | ||
| git_ref="8f753753379e9381429b476aacbf6890ef101438", | ||
| included_extensions=["**/*.java"], | ||
| excluded_extensions=["target/**/*", | ||
| "build/**/*", | ||
| "*.class", | ||
| ".gradle/**/*", | ||
| ".mvn/**/*", | ||
| ".gitignore", | ||
| "test/**/*", | ||
| "tests/**/*", | ||
| "src/test/**/*", | ||
| "pom.xml", | ||
| "build.gradle"]) | ||
| result = await transitive_code_search_runner_coroutine("commons-beanutils:commons-beanutils:1.9.4,org.apache.commons.beanutils.PropertyUtilsBean.getProperty") | ||
| (path_found, list_path) = result | ||
| print(result) | ||
| assert path_found is False | ||
| assert len(list_path) is 1 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Great!
If you are having more, try to add at least more 1-2 inputs to protect your logic from future changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is I don't have more examples, Ronit is working on trying to provide more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac You don't have to wait for him...
Use other repos, and check for methods that are actually being called, and assert for the path length and for the path_found expression ( the second time), assert the boolean result against true
It Doesn't necessarily has to be vulnerable methods, you're only checking your logic of chain of calls
You can even give the same repo with other reachable method of an artifact ( either transitive or direct one).
e9ee7f8 to
15e1ccf
Compare
f5fa123 to
bc89d72
Compare
|
/retest |
2 similar comments
|
/retest |
|
/retest |
8500b8d to
0197565
Compare
zvigrinberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Thanks for your good and hard work.
Please see my comments, we need couple of changes ( nothing critical though).
also, after all changes, please run flows in all ecosystems end to end ( we need to do it until Integration tests implementation WIP is done).
| coc_retriever = get_chain_of_calls_retriever(ecosystem, | ||
| documents, | ||
| git_repo, | ||
| query, | ||
| code_source_info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Please use kw to pass the args instead of positional args, because it's very confusing that way.
| ecosystem = file.read() | ||
| ecosystem = Ecosystem[ecosystem] | ||
| coc_retriever = ChainOfCallsRetriever(documents=documents, ecosystem=ecosystem, manifest_path=git_repo) | ||
| ecosystem = TransitiveCodeSearcher.get_ecosystem(git_repo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac The Problem with this is returning the ecosystem even if the download dependencies' install dependencies method is failing ( or getting the tree builder for the "found ecosystem"), and thus in such case it can continue all the way to do the transitive search even without 3rd party libs source codes , which is wrong.
in current logic, it's only save it to the file in case it completes both of the getting the tree builder, and installing the dependencies...
Please revert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting
| ecosystem: Ecosystem = TransitiveCodeSearcher.get_ecosystem(git_repo_path) | ||
| if ecosystem is None: | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac The Problem with this is returning the ecosystem even if the download dependencies' install dependencies method is failing ( or getting the tree builder for the "found ecosystem"), and thus in such case it can continue all the way to do the transitive search even without 3rd party libs source codes , which is wrong.
in current logic, it's only save it to the file in case it completes both of the getting the tree builder, and installing the dependencies...
Please revert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting
| def is_doc_type(self, doc: Document) -> bool: | ||
| return doc.page_content.startswith("type") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac This is incorrect, the type in python is class
Why did you change it ?
vulnerability-analysis/src/vuln_analysis/utils/functions_parsers/python_functions_parser.py
Lines 99 to 100 in 0197565
| def get_type_reserved_word(self) -> str: | |
| return "class" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac In addition, you should also instead method
| def is_doc_type(self, doc: Document) -> bool: | |
| return doc.page_content.startswith("type") | |
| def is_doc_type(self, doc: Document) -> bool: | |
| return doc.page_content.startswith(self.get_type_reserved_word()) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right
| def is_package_imported(self, code_content: str, identifier: str, callee_package: str = "") -> bool: | ||
| for line in code_content.split(os.linesep): | ||
| if not self.is_comment_line(line): | ||
| if 'import' in line and callee_package in line: | ||
| if identifier: | ||
| if identifier == callee_package: | ||
| if all(word in line for word in ['import', callee_package]): | ||
| return True | ||
| else: | ||
| if all(word in line for word in ['import', callee_package, 'as', identifier]) \ | ||
| or all(word in line for word in ['from', callee_package, 'import', identifier]): | ||
| return True | ||
| else: | ||
| if all(word in line for word in ['from', callee_package, 'import']): | ||
| return True | ||
| return False | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac This one that you passing the code str instead of the document, and refactoring of callers were a good decision, it was simplified the readability of the code
| i += 1 | ||
| return n - 1 | ||
|
|
||
| def _match_balanced_braces(src: str, i: int) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac please document function purpose succinctly , what it gets as input, what it does, and what is the integer that it returns to callers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| i -= 1 | ||
| return 0 | ||
|
|
||
| def _match_paren_backwards(src: str, i: int) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Function intention and purpose is not quite clear:
please document function purpose succinctly , what it gets as input, what it does, and what is the integer that it returns to callers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| i2 += 1 | ||
| return "" | ||
|
|
||
| def _enum_constants_terminator(reg: _TypeRegion) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Function intention and purpose is not quite clear:
please document function purpose succinctly , what it gets as input, what it does, and what is the integer that it returns to callers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
|
||
| return s | ||
|
|
||
| def convert_to_maven_artifact(package_name: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac This is only used internally here, please refactor to start with underscore for a better readability and to reflect intention.
If you think that you're going to use it outside of this module in the future, you can leave it as is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| </tool_identity> | ||
| <requires> | ||
| Input format: 'package_name,function_name' or 'package_name,class_name.method_name' | ||
| Input format: 'package_name,function_name' or 'package_name,class_name.method_name' or maven_gav(groupId:artifactId:version),fully_qualified_class_name.method_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Did you try out the new package and function locator tool integration with JAVA?
is it working as expected?
This is something that needed to be checked manually, as we still haven't implemented integration tests yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I test it, isn't it called all the time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks to be working correctly, but haven't tried many inputs because I don't have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi OK.
All i'm asking is if it works as before or as good as it was before, with this new tool description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For java it looks to be working the same, for go, c, python hard for me to test
4ede2b8 to
4f62893
Compare
Signed-off-by: Theodor Mihalache <[email protected]>
4f62893 to
5ba4d90
Compare
Signed-off-by: Theodor Mihalache <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]>
|
/retest vulnerability-analysis-on-pr |
1 similar comment
|
/retest vulnerability-analysis-on-pr |
134dca9 to
6e8f9b3
Compare
Signed-off-by: Theodor Mihalache <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]>
…a-transitive-search # Conflicts: # src/vuln_analysis/tools/transitive_code_search.py
|
/retest vulnerability-analysis-on-pr |
zvigrinberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Good Job.
Please Take a look on my comments , most of them are minor ones.
Thanks.
src/vuln_analysis/utils/functions_parsers/java_functions_parsers.py
Outdated
Show resolved
Hide resolved
src/vuln_analysis/utils/functions_parsers/java_functions_parsers.py
Outdated
Show resolved
Hide resolved
| while True: | ||
| while i < n and s[i].isspace(): | ||
| i += 1 | ||
| if i < n and s[i] == '@': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Please replace @ with a constant JAVA_ANNOTATION_SYMBOL
| out.append(x) | ||
| return out | ||
|
|
||
| def tup_dedup_keep_order(xs: List[Tuple[str, str]]) -> List[Tuple[str, str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Is it possible to use a more meaningful argument name ? ( what is xs?)
In addition, would be better to document this function.
| def simple_name_from_fqcn(fq: str) -> str: | ||
| return fq.split('.')[-1] if fq else fq | ||
|
|
||
| def dedup_keep_order(xs: List[str]) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmihalac Is it possible to use a more meaningful argument name ? ( what is xs?)
In addition, would be better to document this function.
Signed-off-by: Theodor Mihalache <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]>
…added docstring Signed-off-by: Theodor Mihalache <[email protected]>
…d docstring Signed-off-by: Theodor Mihalache <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]>
…st instead of set Signed-off-by: Theodor Mihalache <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]>

No description provided.