Author reconstruction for 'GitHub' user#53
Conversation
This allows for reconstruction of correct commit author if user is github Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also added one comment for clarity Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
|
This PR should also be ready for review. I was unable to test my changes in python 2, but since we are porting to python 3 in the near future I hope this is not an issue. I do also have the tested python 3 code locally, which could make @maxloeffler work a bit easier. |
There was a problem hiding this comment.
Pull Request Overview
This PR addresses issue #52 by implementing author reconstruction for events triggered by the 'GitHub' user. The main purpose is to capture and preserve commit author information in event data so that the actual commit author can be reconstructed during post-processing, even when the event appears to be triggered by the generic 'GitHub' user.
- Adds commit author information to
event_info_2field forcommit_addedevents - Implements connected events filtering and matching logic to handle related issue connections
- Updates author post-processing to use the preserved author information as a fallback when commit data is unavailable
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| issue_processing/issue_processing.py | Adds author preservation logic, connected events handling, and updates user processing to include commit author information |
| author_postprocessing/author_postprocessing.py | Implements fallback author reconstruction using preserved author names when commit hash lookup fails |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
bockthom
left a comment
There was a problem hiding this comment.
Could you please address the following issues, as well as the issues previously pointed out by Copilot, as well as the other unresolved one regarding necessary comments from a previous review of mine?
Thank you!
|
And the copyright header in |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
also save merge commits reconstruction of connected events is done by first saving all connected events that occured at the same time. Then, it is possible to match connected events iff: - half of the involved issues are equal, meaning that one issue is connected to multiple others - half rounded up of the involved isses are equal, meaning that we have one external connected event and then the previous case with the remaining issues Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
since data is modified in-place, return of input data is not needed Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
ALso add commit hash if closed by commit Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also rename 'new feature' to 'feature' Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also remove duplicates from type list Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
using empty line reserved for jira components Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also added copyright header Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also minor fixes and removal of math.ceil Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
comments now each have a boolean field that describes whether the comment contains a suggestion or not Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
dicts for reconstructing connected events are now better explained and the comments do not disruot the workflow in the run function anymore Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
includes: - updated comments - spelling mistake - fix for potential crash if script is used on old data Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
author postprocessing now also contains a list of known copilot use names that can be extended to unify more different copilot users Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
the events 'copilot_work_started' and 'copilot_work_finished' now always have the standard copilot user data Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Method doc updated to reflect new functionality Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
previously, the creator of the issues was falsely matched to the connected event instead of the user triggering the event Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
|
One more issue: The anonymization script cannot handle our manually introduced Copilot user as it is not part of the authors.list file: Solution: Add Copilot to the authors.list if it is not present there (during author_postprocessing). However, we might also think about whether we ignore it in anonymization. Please check whether we ignore the GitHub user in anonymization or not. |
unification now done on all files, which should prevent any issues arising from unknown authors during anonymization also move all global variables to a new utils file Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Known agentsc such as 'copilot' or 'claude' can now be read, similar to known bots. They will be flagged as agents during bot processing. Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Add a helper function for creating bot name variants utilizing either '[bot]' or 'bot' suffix. Also update bot processing to check user buffer for all variants. Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Add a helper function that given a botname and a list of names, returns which bot name variant is contained in the list (or None). This is used whenever we check if a known bot is in the userdata or has been predicted to be a bot, and means that botnames in the known_bots file do not need to be duplicated for each variant. Also, automatically add all known coplilot users to the known_agents list, and then unify those during author postprocessing. Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
There was a problem hiding this comment.
I found some time to run the issue analysis on our favorite test project.
This way, I encountered a number of issues that still need to be fixed. I guess the majority of them are easy to fix - only one or two might need a little more thinking on how to fix it. Please see my detailed comments below.
also add agents to bot handling, fix formatting for event_info_2 and subissues also fix a typo where strings would not have their quotes correctly removed Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
|
This PR looks good to me now, so far. I ran it on one project for testing reasons and did not encounter any unexpected behavior any more. 🥳 Let's wait for the last tiny little piece (the reason for locking), and then this PR can be merged. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.
Comments suppressed due to low confidence (1)
issue_processing/issue_processing.py:603
reformat_eventsnow mutatesissue_datain-place and returnsNone, but its docstring still says it returns updated issue data. Please update the docstring (or return the list) to keep the API self-documenting.
def reformat_events(issue_data, filtered_connected_events, external_connected_events):
"""
Re-format event information dependent on the event type.
:param issue_data: the data of all issues that shall be re-formatted
:return: the issue data with updated event information
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lock reason is saved in event_info_1 Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
docstrings should now more accurately reflect parameters and return values Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
For consistency with github events Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
bot_processing/bot_processing.py:137
predicted_botsis documented/used as “usernames of the bots predicted to be a bot”, but the current code collects usernames from all rows inbot_data(which includes humans too perload_bot_data). This can prevent known bots/agents from being added because they will appear as “predicted” even if predicted as Human. Filterpredicted_botsby the prediction column (e.g., only rows with labelBot/Agent) before doing membership checks.
# Get the GitHub usernames of the bots predicted to be a bot
predicted_bots = [bot[0] if len(bot) > 0 else "" for bot in bot_data]
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
previously removed event_info_2 for state_updated event, leading to crashes of the issue processing. Now, it instead contains an empty string. Also fix a minor spelling mistake Signed-off-by: <s8lesend@stud.uni-saarland.de>
This pr adresses issue #52
Fixes this issue by adding the correct author to field 'event_info_2' in the issue data. This allows for reconstruction of the commit's author during author postprocessing, if the user that envoked the event is the 'GitHub' user.
Since parts of the changes take place in functions relying on codeface, testing all the changes is currently not possible.