Skip to content

Conversation

@ThomasWaldmann
Copy link
Member

No description provided.

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Sep 6, 2025

Fix and test mostly written by Junie AI, some cleanups by me.

@codecov
Copy link

codecov bot commented Sep 6, 2025

Codecov Report

❌ Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 80.59%. Comparing base (56dda84) to head (ce27418).
⚠️ Report is 4 commits behind head on 1.4-maint.

Files with missing lines Patch % Lines
src/borg/archive.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##           1.4-maint    #9003      +/-   ##
=============================================
- Coverage      80.60%   80.59%   -0.01%     
=============================================
  Files             38       38              
  Lines          11250    11247       -3     
  Branches        1767     1767              
=============================================
- Hits            9068     9065       -3     
  Misses          1611     1611              
  Partials         571      571              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MichaelDietzel
Copy link

MichaelDietzel commented Oct 2, 2025

I finally did some simple tests with these changes and the results are more consistent than what I got before. I will run some bigger tests over night.
This looks a lot simpler than what I attempted which I like. But could there maybe be some missed cases?
A few things that caused difficulties for me:

  • In Archive.py in the save()-function there is line 607 self.items_buffer.flush(flush=True). If I understand it correctly this also causes some stats to be updated. But it looks to me like these stats are never used for anything?
  • In Archive.py in the set_meta()-function there is line 962 self.cache.add_chunk(new_id, data, self.stats) which also appears to me as it counts metadata into some statistics. However this is probably not used during archive creation
  • How does the size of symlinks influence the stats? The size of symlinks could be platform dependent, so I am completely unsure how this is handled. Maybe this is out of scope for this change?

What I also do not understand: what are these new metadata stats used for, are they stored or reported at any time? If not: maybe they do not even need to be generated at all.

@MichaelDietzel
Copy link

MichaelDietzel commented Oct 5, 2025

I did a few more tests for archive creation under linux with this PR added:

Original size exactly matches the size of the original files (If only regular files are used, symlinks are not counted. I have not tested other special cases.)

The compressed size always looks plausible

The deduplicated size still has some unexpected results:
Here is an example where I created an empty archive:

create
                       Original size      Compressed size    Deduplicated size
This archive:                    0 B                  0 B                  0 B
All archives:                    0 B                  0 B                700 B

info
                       Original size      Compressed size    Deduplicated size
This archive:                    0 B                  0 B                700 B
All archives:                    0 B                  0 B                700 B

Do I understand the code correctly that the reported sizes in borg info are "cached" values that are stored to the repo instead of recalculated values, so that they should not change between create and info?

The empty repo without any archives reported 0 for all sizes.

And here an example for a bigger archive:

create
                       Original size      Compressed size    Deduplicated size
This archive:         967532246644 B       889015534170 B       727966606714 B
All archives:         967532246644 B       889015534170 B       728235281358 B

info
                       Original size      Compressed size    Deduplicated size
This archive:         967532246644 B       889015534170 B       728235281358 B
All archives:         967532246644 B       889015534170 B       728235281358 B

Sadly I have no Idea what could be causing this except for what I already stated I was wondering about in my previous post.
I found two places where these additional Bytes are added to the stats in the save-function:
self.items_buffer.flush(flush=True) and self.cache.add_chunk(self.id, data, self.meta_stats if hasattr(self, 'meta_stats') else self.stats). However I would still have to find out why they are not displayed for "This archive" during creation but then are displayed afterwards or why they are only visible for the deduplicated size.

I did not test any other operations than archive creation, that I would have to do next.

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Oct 10, 2025

@MichaelDietzel I let Junie do a bit more work. First it added a bloody workaround, but on second try I guess it found the correct way. :-)

borg now in general does not account for metadata chunks in "this archive" stats, neither in "create" nor in "info".

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this pull request Oct 10, 2025
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this pull request Oct 10, 2025
…cated size" stats by excluding metadata chunks

fixes issue found in borgbackup#9003 comments.
@ThomasWaldmann
Copy link
Member Author

@MichaelDietzel guess this is as good as it gets?

there is still some discrepancy between "this archive" and "all archives" (which is rather "whole repo", computed in a rather different way).

@MichaelDietzel
Copy link

Thanks, I will take a look.

@ThomasWaldmann
Copy link
Member Author

@MichaelDietzel Thanks for helping!

BTW, I am preparing a borg 1.4.2 release, would be good if this PR could go into that.

@ThomasWaldmann ThomasWaldmann added this to the 1.4.2 milestone Oct 15, 2025
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this pull request Oct 25, 2025
add tests ensuring:
- borg info and create report same "This archive" deduplicated size
- before/after borg recreate it reports same "This archive" deduplicated size
- this/all archive(s) stats are same if 1 archive is in repo
- all archives stats is same for borg create and borg info

note that some stats differences are expected.
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this pull request Oct 25, 2025
borgbackup#9003

do not account archive metadata, only file contents.
@ThomasWaldmann
Copy link
Member Author

Tests rewritten to use borg's json output rather than regex-parsing the console output.

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this pull request Oct 25, 2025
add tests ensuring:
- borg info and create report same "This archive" deduplicated size
- before/after borg recreate it reports same "This archive" deduplicated size
- this/all archive(s) stats are same if 1 archive is in repo
- all archives stats is same for borg create and borg info

note that some stats differences are expected.
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this pull request Oct 25, 2025
borgbackup#9003

do not account archive metadata, only file contents.
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this pull request Oct 25, 2025
borgbackup#9003

do not account archive metadata, only file contents.
@MichaelDietzel
Copy link

Thank you for all your hard work and sorry for taking that long to find time to review this.

Compared to the version I tested previously the statistics printed by create and by info now match each other!
Also the source code changes look reasonable to me with my limited understanding of borg.

I just think I do not fully understand the resoning behind the test_single_archive_all_equals_this_info_and_create-test. Maybe this test was initially intended to test for something else?

  • When reading the name and the description I would think that it does something else than what it actually does. I thought that the test would test "this archive deduplicated size" = "all archives deduplicated size". Especially as it states "With a single archive in the repository". I think the number of archives should not matter for what this test does, right?
  • it seems like this test is doing pretty much does the same as the two other tests (test_info_matches_create_deduplicated_size and test_info_matches_create_all_archives_deduplicated_size), just now joined into one test. The only difference seems to be the last two lines that check for this != all. If this is really to be tested maybe these lines should check this < all instead?

@ThomasWaldmann
Copy link
Member Author

@MichaelDietzel I addressed all your feedback in the recent commit, thanks for finding these issues!

add tests ensuring:
- borg info and create report same "This archive" deduplicated size
- borg info and create report same "All archives" deduplicated size
- before/after borg recreate it reports same "This archive" deduplicated size

note that some stats differences are expected, because the repo-level
deduplication ("All archives") is computed in a different way than "This archive".
borgbackup#9003

do not account archive metadata, only file contents.
@ThomasWaldmann
Copy link
Member Author

Collapsed some commits, improved commit comments.

@ThomasWaldmann ThomasWaldmann merged commit 7a69499 into borgbackup:1.4-maint Oct 27, 2025
9 checks passed
@ThomasWaldmann ThomasWaldmann deleted the fix-8898 branch October 27, 2025 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants