-
Notifications
You must be signed in to change notification settings - Fork 4
Implemented b-state for unicodeobject #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: migration
Are you sure you want to change the base?
Conversation
for unicode hide bstate inside of unused bit of int for byte create a subclass object pg_bytes with bstate field
Include/cpython/unicodeobject.h
Outdated
| /* Padding to ensure that PyUnicode_DATA() is always aligned to | ||
| 4 bytes (see issue #19537 on m68k). */ | ||
| unsigned int :25; | ||
| unsigned int bstate:4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a different integer for Unicode, byte, undetermined, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out. I think since I allocate 4 bits for this field, i can use different number to represent different state like 0->unsure, 1->byte, 2->Unicode. But I think your suggestion is better that it can be more clear and easier to access and modify. Will fix!
add bstate for unicodeobject add some useful macro add logic and warning for unicodeobject.concat
| return NULL; | ||
| } | ||
| } | ||
| else{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting has gone wonky here.
| #define PyBytes_GET_BSTATE(op) (((PyBytesObject *)(op))->bstate) | ||
| #define PyBytes_SET_BSTATE(op, val) (((PyBytesObject *)(op))->bstate = (unsigned int)(val)) | ||
|
|
||
| #define PG_BSTATE_LOAD_BYTES(op_) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure what the intention of this macro is, and the one call site doesn't immediately make it obvious to me. Can we provide a quick doc string?
| 4 bytes (see issue #19537 on m68k). */ | ||
| unsigned int :25; | ||
| unsigned int bstate:4; | ||
| unsigned int :21; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line has now been divorced from its doc string, so the "always aligned to 4 bytes" no longer makes sense.
That means that bstate needs lifting to a few lines earlier and having a short doc string. In particular, it's not obvious to me why this needs 4 bits. Doesn't it only need 2?
| unsure + byte = byte | ||
| unicode + unsure = unicode | ||
| unicode + byte = unicode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so this is I think the big thing we need to consider (alongside byte + unicode = byte). In a sense, whenever this scenario happens it's "bad": the user is mixing things up in a way that won't work in Python 3.
We then have three choices: (1) we arbitrarily pick a winner (which is what the above logic does) (2) we revert back to "unsure" or (3) we have an additional "mixed" kind. I can see advantages to all three possibilities but it's hard to know which might be best. My probable criteria would be "which causes the fewest spurious warnings after the first warning?" I suspect we actually need to try this on some real code to know.
|
There's one major thing to consider above (#37 (comment)) but in addition:
|
|
Lets pause due to the |
|
FWIW I think this PR is heading in the right direction! |
|
Well we talked of not needing |
Implemented b-state for unicodeobject by utilizing 4 bit of a int in unicodeobject->PyCompactUnicodeObject->PyASCIIObject->state
Implemented 2 macro for accessing and setting b-state