Implemented b-state for unicodeobject #37

Eddy114514 · 2025-11-06T01:02:50Z

Implemented b-state for unicodeobject by utilizing 4 bit of a int in unicodeobject->PyCompactUnicodeObject->PyASCIIObject->state

Implemented 2 macro for accessing and setting b-state

merge

for unicode hide bstate inside of unused bit of int for byte create a subclass object pg_bytes with bstate field

nanjekyejoannah · 2025-11-11T13:56:49Z

Include/cpython/unicodeobject.h

        /* Padding to ensure that PyUnicode_DATA() is always aligned to
           4 bytes (see issue #19537 on m68k). */
-        unsigned int :25;
+        unsigned int bstate:4


We should have a different integer for Unicode, byte, undetermined, no?

Thanks for pointing out. I think since I allocate 4 bits for this field, i can use different number to represent different state like 0->unsure, 1->byte, 2->Unicode. But I think your suggestion is better that it can be more clear and easier to access and modify. Will fix!

add bstate for unicodeobject add some useful macro add logic and warning for unicodeobject.concat

ltratt · 2025-11-18T08:33:19Z

Objects/unicodeobject.c

+            return NULL;
+        }
+    }
+    else{


Formatting has gone wonky here.

ltratt · 2025-11-18T08:34:01Z

Include/cpython/bytesobject.h

+#define PyBytes_GET_BSTATE(op)       (((PyBytesObject *)(op))->bstate)
+#define PyBytes_SET_BSTATE(op, val)  (((PyBytesObject *)(op))->bstate = (unsigned int)(val))
+
+#define PG_BSTATE_LOAD_BYTES(op_) \


I'm not really sure what the intention of this macro is, and the one call site doesn't immediately make it obvious to me. Can we provide a quick doc string?

ltratt · 2025-11-18T08:35:54Z

Include/cpython/unicodeobject.h

           4 bytes (see issue #19537 on m68k). */
-        unsigned int :25;
+        unsigned int bstate:4;
+        unsigned int :21;


This line has now been divorced from its doc string, so the "always aligned to 4 bytes" no longer makes sense.

That means that bstate needs lifting to a few lines earlier and having a short doc string. In particular, it's not obvious to me why this needs 4 bits. Doesn't it only need 2?

ltratt · 2025-11-18T08:40:02Z

Include/cpython/unicodeobject.h

+unsure + byte = byte
+
+unicode + unsure = unicode
+unicode + byte = unicode


Right, so this is I think the big thing we need to consider (alongside byte + unicode = byte). In a sense, whenever this scenario happens it's "bad": the user is mixing things up in a way that won't work in Python 3.

We then have three choices: (1) we arbitrarily pick a winner (which is what the above logic does) (2) we revert back to "unsure" or (3) we have an additional "mixed" kind. I can see advantages to all three possibilities but it's hard to know which might be best. My probable criteria would be "which causes the fewest spurious warnings after the first warning?" I suspect we actually need to try this on some real code to know.

ltratt · 2025-11-18T08:43:37Z

There's one major thing to consider above (#37 (comment)) but in addition:

I don't think we want to repeat the same macros (with minor variations) twice. Can we put these into one place?
PRs like this need at least one test.

nanjekyejoannah · 2025-11-18T19:45:07Z

Lets pause due to the bstate discussion.

ltratt · 2025-11-20T19:58:31Z

FWIW I think this PR is heading in the right direction!

nanjekyejoannah · 2025-12-03T19:15:29Z

Well we talked of not needing bstate in pygrate3 but yeah lets touch base for consensus in the meeting.

Eddy114514 and others added 3 commits October 29, 2025 17:20

Merge pull request #1 from softdevteam/migration

f936e80

merge

add bstate for string obj of str(unicode) and byte

ed79de6

for unicode hide bstate inside of unused bit of int for byte create a subclass object pg_bytes with bstate field

just do unicode

4677437

Eddy114514 requested review from ltratt and nanjekyejoannah November 6, 2025 01:02

Eddy114514 assigned ltratt and nanjekyejoannah Nov 6, 2025

nanjekyejoannah reviewed Nov 11, 2025

View reviewed changes

add bstate for bytesobject

105cce6

add bstate for unicodeobject add some useful macro add logic and warning for unicodeobject.concat

ltratt reviewed Nov 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implemented b-state for unicodeobject #37

Implemented b-state for unicodeobject #37

Uh oh!

Eddy114514 commented Nov 6, 2025

Uh oh!

nanjekyejoannah Nov 11, 2025

Uh oh!

Eddy114514 Nov 11, 2025 •

edited

Loading

Uh oh!

ltratt Nov 18, 2025

Uh oh!

ltratt Nov 18, 2025

Uh oh!

ltratt Nov 18, 2025

Uh oh!

ltratt Nov 18, 2025

Uh oh!

ltratt commented Nov 18, 2025

Uh oh!

nanjekyejoannah commented Nov 18, 2025

Uh oh!

ltratt commented Nov 20, 2025

Uh oh!

nanjekyejoannah commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implemented b-state for unicodeobject #37

Are you sure you want to change the base?

Implemented b-state for unicodeobject #37

Uh oh!

Conversation

Eddy114514 commented Nov 6, 2025

Uh oh!

nanjekyejoannah Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Eddy114514 Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ltratt Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

ltratt Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

ltratt Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

ltratt Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

ltratt commented Nov 18, 2025

Uh oh!

nanjekyejoannah commented Nov 18, 2025

Uh oh!

ltratt commented Nov 20, 2025

Uh oh!

nanjekyejoannah commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Eddy114514 Nov 11, 2025 •

edited

Loading