Skip to content

Author reconstruction for 'GitHub' user#53

Open
Leo-Send wants to merge 26 commits intose-sic:masterfrom
Leo-Send:master
Open

Author reconstruction for 'GitHub' user#53
Leo-Send wants to merge 26 commits intose-sic:masterfrom
Leo-Send:master

Conversation

@Leo-Send
Copy link
Copy Markdown

@Leo-Send Leo-Send commented Aug 25, 2025

This pr adresses issue #52

Fixes this issue by adding the correct author to field 'event_info_2' in the issue data. This allows for reconstruction of the commit's author during author postprocessing, if the user that envoked the event is the 'GitHub' user.

Since parts of the changes take place in functions relying on codeface, testing all the changes is currently not possible.

This allows for reconstruction of correct commit author if user is
github

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
@Leo-Send Leo-Send changed the title Add commit author of 'commit_added' events to event info Author reconstruction for 'GitHub' user Aug 25, 2025
Copy link
Copy Markdown
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @Leo-Send.
I don't have run your changes yet, but here are already two comments:

Could you please update your copyright header in the files you have changed?

also added one comment for clarity

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
@Leo-Send
Copy link
Copy Markdown
Author

This PR should also be ready for review. I was unable to test my changes in python 2, but since we are porting to python 3 in the near future I hope this is not an issue. I do also have the tested python 3 code locally, which could make @maxloeffler work a bit easier.

@bockthom bockthom requested a review from Copilot September 30, 2025 05:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses issue #52 by implementing author reconstruction for events triggered by the 'GitHub' user. The main purpose is to capture and preserve commit author information in event data so that the actual commit author can be reconstructed during post-processing, even when the event appears to be triggered by the generic 'GitHub' user.

  • Adds commit author information to event_info_2 field for commit_added events
  • Implements connected events filtering and matching logic to handle related issue connections
  • Updates author post-processing to use the preserved author information as a fallback when commit data is unavailable

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
issue_processing/issue_processing.py Adds author preservation logic, connected events handling, and updates user processing to include commit author information
author_postprocessing/author_postprocessing.py Implements fallback author reconstruction using preserved author names when commit hash lookup fails

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Copy Markdown
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please address the following issues, as well as the issues previously pointed out by Copilot, as well as the other unresolved one regarding necessary comments from a previous review of mine?

Thank you!

@bockthom bockthom requested a review from Copilot October 21, 2025 01:24
@bockthom
Copy link
Copy Markdown
Collaborator

And the copyright header in issue_processing/jira_issue_processing.py needs to be updated as well.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

also save merge commits
reconstruction of connected events is done by first saving all connected
events that occured at the same time. Then, it is possible to match
connected events iff:
- half of the involved issues are equal, meaning that one issue is
  connected to multiple others
- half rounded up of the involved isses are equal, meaning that we have
  one external connected event and then the previous case with the
remaining issues

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
since data is modified in-place, return of input data is not needed

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
ALso add commit hash if closed by commit

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also rename 'new feature' to 'feature'

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also remove duplicates from type list

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
using empty line reserved for jira components

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also added copyright header

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
also minor fixes and removal of math.ceil

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
comments now each have a boolean field that describes whether the
comment contains a suggestion or not

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
dicts for reconstructing connected events are now better explained and
the comments do not disruot the workflow in the run function anymore

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
@bockthom bockthom requested a review from Copilot November 4, 2025 08:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

includes:
- updated comments
- spelling mistake
- fix for potential crash if script is used on old data

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
author postprocessing now also contains a list of known copilot use
names that can be extended to unify more different copilot users

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
the events 'copilot_work_started' and 'copilot_work_finished' now always
have the standard copilot user data

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Method doc updated to reflect new functionality

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
previously, the creator of the issues was falsely matched to the
connected event instead of the user triggering the event

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
@bockthom
Copy link
Copy Markdown
Collaborator

bockthom commented Feb 2, 2026

One more issue:

The anonymization script cannot handle our manually introduced Copilot user as it is not part of the authors.list file:

File "anonymization/anonymization.py", line 236, in run_anonymization
    new_author = author_to_anonymized_author[(issue_event[9], issue_event[10])]
KeyError: ('Copilot', 'copilot@example.com')

Solution: Add Copilot to the authors.list if it is not present there (during author_postprocessing).

However, we might also think about whether we ignore it in anonymization. Please check whether we ignore the GitHub user in anonymization or not.

unification now done on all files, which should prevent any issues
arising from unknown authors during anonymization
also move all global variables to a new utils file

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Known agentsc such as 'copilot' or 'claude' can now be read, similar to
known bots. They will be flagged as agents during bot processing.

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Add a helper function for creating bot name variants utilizing either
'[bot]' or 'bot' suffix. Also update bot processing to check user buffer
for all variants.

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Add a helper function that given a botname and a list of names, returns
which bot name variant is contained in the list (or None). This is used
whenever we check if a known bot is in the userdata or has been
predicted to be a bot, and means that botnames in the known_bots file do
not need to be duplicated for each variant.
Also, automatically add all known coplilot users to the known_agents
list, and then unify those during author postprocessing.

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Copy link
Copy Markdown
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some time to run the issue analysis on our favorite test project.
This way, I encountered a number of issues that still need to be fixed. I guess the majority of them are easy to fix - only one or two might need a little more thinking on how to fix it. Please see my detailed comments below.

also add agents to bot handling, fix formatting for event_info_2 and
subissues
also fix a typo where strings would not have their quotes correctly
removed

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
@bockthom
Copy link
Copy Markdown
Collaborator

bockthom commented Mar 4, 2026

This PR looks good to me now, so far. I ran it on one project for testing reasons and did not encounter any unexpected behavior any more. 🥳

Let's wait for the last tiny little piece (the reason for locking), and then this PR can be merged.
Meanwhile, as we are close to merging already, I will ask @copilot to review this PR for us.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (1)

issue_processing/issue_processing.py:603

  • reformat_events now mutates issue_data in-place and returns None, but its docstring still says it returns updated issue data. Please update the docstring (or return the list) to keep the API self-documenting.
def reformat_events(issue_data, filtered_connected_events, external_connected_events):
    """
    Re-format event information dependent on the event type.

    :param issue_data: the data of all issues that shall be re-formatted
    :return: the issue data with updated event information
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lock reason is saved in event_info_1

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
docstrings should now more accurately reflect parameters and return
values

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
For consistency with github events

Signed-off-by: Leo Sendelbach <s8lesend@stud.uni-saarland.de>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

bot_processing/bot_processing.py:137

  • predicted_bots is documented/used as “usernames of the bots predicted to be a bot”, but the current code collects usernames from all rows in bot_data (which includes humans too per load_bot_data). This can prevent known bots/agents from being added because they will appear as “predicted” even if predicted as Human. Filter predicted_bots by the prediction column (e.g., only rows with label Bot/Agent) before doing membership checks.
    # Get the GitHub usernames of the bots predicted to be a bot
    predicted_bots = [bot[0] if len(bot) > 0 else "" for bot in bot_data]


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

previously removed event_info_2 for state_updated event, leading to
crashes of the issue processing. Now, it instead contains an empty
string.
Also fix a minor spelling mistake

Signed-off-by: <s8lesend@stud.uni-saarland.de>
Copy link
Copy Markdown
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't spot any errors or inconsistencies in my test run, so let's assume everything works as intended.
Thanks for your efforts @Leo-Send.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants