Skip to content

Conversation

@Eddy114514
Copy link
Collaborator

Implemented b-state for unicodeobject by utilizing 4 bit of a int in unicodeobject->PyCompactUnicodeObject->PyASCIIObject->state

Implemented 2 macro for accessing and setting b-state

Eddy114514 and others added 3 commits October 29, 2025 17:20
for unicode hide bstate inside of unused bit of int
for byte create a subclass object pg_bytes with
bstate field
/* Padding to ensure that PyUnicode_DATA() is always aligned to
4 bytes (see issue #19537 on m68k). */
unsigned int :25;
unsigned int bstate:4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a different integer for Unicode, byte, undetermined, no?

Copy link
Collaborator Author

@Eddy114514 Eddy114514 Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out. I think since I allocate 4 bits for this field, i can use different number to represent different state like 0->unsure, 1->byte, 2->Unicode. But I think your suggestion is better that it can be more clear and easier to access and modify. Will fix!

add bstate for unicodeobject
add some useful macro
add logic and warning for unicodeobject.concat
return NULL;
}
}
else{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting has gone wonky here.

#define PyBytes_GET_BSTATE(op) (((PyBytesObject *)(op))->bstate)
#define PyBytes_SET_BSTATE(op, val) (((PyBytesObject *)(op))->bstate = (unsigned int)(val))

#define PG_BSTATE_LOAD_BYTES(op_) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what the intention of this macro is, and the one call site doesn't immediately make it obvious to me. Can we provide a quick doc string?

4 bytes (see issue #19537 on m68k). */
unsigned int :25;
unsigned int bstate:4;
unsigned int :21;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line has now been divorced from its doc string, so the "always aligned to 4 bytes" no longer makes sense.

That means that bstate needs lifting to a few lines earlier and having a short doc string. In particular, it's not obvious to me why this needs 4 bits. Doesn't it only need 2?

unsure + byte = byte
unicode + unsure = unicode
unicode + byte = unicode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so this is I think the big thing we need to consider (alongside byte + unicode = byte). In a sense, whenever this scenario happens it's "bad": the user is mixing things up in a way that won't work in Python 3.

We then have three choices: (1) we arbitrarily pick a winner (which is what the above logic does) (2) we revert back to "unsure" or (3) we have an additional "mixed" kind. I can see advantages to all three possibilities but it's hard to know which might be best. My probable criteria would be "which causes the fewest spurious warnings after the first warning?" I suspect we actually need to try this on some real code to know.

@ltratt
Copy link
Member

ltratt commented Nov 18, 2025

There's one major thing to consider above (#37 (comment)) but in addition:

  • I don't think we want to repeat the same macros (with minor variations) twice. Can we put these into one place?
  • PRs like this need at least one test.

@nanjekyejoannah
Copy link
Collaborator

Lets pause due to the bstate discussion.

@ltratt
Copy link
Member

ltratt commented Nov 20, 2025

FWIW I think this PR is heading in the right direction!

@nanjekyejoannah
Copy link
Collaborator

Well we talked of not needing bstate in pygrate3 but yeah lets touch base for consensus in the meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants