-
Notifications
You must be signed in to change notification settings - Fork 15k
[AMDGPU][SDAG] Initial support for ISD::PTRADD #141725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU][SDAG] Initial support for ISD::PTRADD #141725
Conversation
@llvm/pr-subscribers-llvm-selectiondag @llvm/pr-subscribers-backend-amdgpu Author: Fabian Ritter (ritter-x2a) ChangesEnable generation of PTRADD SelectionDAG nodes for pointer arithmetic for SI, Since GlobalISel also uses the PTRADD SDAG patterns via SelectionDAGCompat,
Overall, the changes in the GlobalISel tests seem to be improvements. Patch is 1.01 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/141725.diff 50 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index ade88a16193b8..15e060a6e8e59 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -61,6 +61,13 @@ static cl::opt<bool> UseDivergentRegisterIndexing(
cl::desc("Use indirect register addressing for divergent indexes"),
cl::init(false));
+// TODO This option should be removed once we switch to always using PTRADD in
+// the SelectionDAG.
+static cl::opt<bool> UseSelectionDAGPTRADD(
+ "amdgpu-use-sdag-ptradd", cl::Hidden,
+ cl::desc("Generate ISD::PTRADD nodes in the SelectionDAG ISel"),
+ cl::init(false));
+
static bool denormalModeIsFlushAllF32(const MachineFunction &MF) {
const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
return Info->getMode().FP32Denormals == DenormalMode::getPreserveSign();
@@ -10419,6 +10426,11 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
}
}
+bool SITargetLowering::shouldPreservePtrArith(const Function &F,
+ EVT PtrVT) const {
+ return UseSelectionDAGPTRADD;
+}
+
// The raw.(t)buffer and struct.(t)buffer intrinsics have two offset args:
// offset (the offset that is included in bounds checking and swizzling, to be
// split between the instruction's voffset and immoffset fields) and soffset
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index c42366a1c04c8..bd9ec7cb8ec48 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -258,6 +258,8 @@ class SITargetLowering final : public AMDGPUTargetLowering {
bool shouldExpandVectorDynExt(SDNode *N) const;
+ bool shouldPreservePtrArith(const Function &F, EVT PtrVT) const override;
+
private:
// Analyze a combined offset from an amdgcn_s_buffer_load intrinsic and store
// the three offsets (voffset, soffset and instoffset) into the SDValue[3]
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 2e2913d88cc54..3ded1393e2ce3 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -1376,6 +1376,37 @@ def : GCNPat <
(i32 (V_MOV_B32_e32 (i32 0))), sub1)
>;
+//===----------------------------------------------------------------------===//
+// PTRADD Patterns
+//===----------------------------------------------------------------------===//
+
+def : GCNPat<
+ (DivergentBinFrag<ptradd> i64:$src0, i64:$src1),
+ (V_ADD_U64_PSEUDO $src0, $src1)>;
+
+def : GCNPat<
+ (DivergentBinFrag<ptradd> i32:$src0, i32:$src1),
+ (V_ADD_U32_e64 $src0, $src1, 0)> {
+ let SubtargetPredicate = HasAddNoCarryInsts;
+}
+
+def : GCNPat<
+ (DivergentBinFrag<ptradd> i32:$src0, i32:$src1),
+ (V_ADD_CO_U32_e64 $src0, $src1)> {
+ let SubtargetPredicate = NotHasAddNoCarryInsts;
+}
+
+def : GCNPat<
+ (UniformBinFrag<ptradd> i64:$src0, i64:$src1),
+ (S_ADD_U64_PSEUDO $src0, $src1)>;
+
+// Whether we select S_ADD_I32 or S_ADD_U32 does not make much of a
+// difference. Most notably, S_ADD_I32 instructions can be transformed
+// to S_ADDK_I32, so we select that.
+def : GCNPat<
+ (UniformBinFrag<ptradd> i32:$src0, i32:$src1),
+ (S_ADD_I32 $src0, $src1)>;
+
/********** ============================================ **********/
/********** Extraction, Insertion, Building and Casting **********/
/********** ============================================ **********/
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll
index 7adaddf2fc8ba..5a3b36fc1ada2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll
@@ -36,15 +36,15 @@ define amdgpu_kernel void @kernel_caller_stack() {
; FLATSCR-NEXT: s_mov_b32 s32, 0
; FLATSCR-NEXT: s_add_u32 flat_scratch_lo, s8, s13
; FLATSCR-NEXT: s_addc_u32 flat_scratch_hi, s9, 0
-; FLATSCR-NEXT: s_add_u32 s0, s32, 4
+; FLATSCR-NEXT: s_add_i32 s0, s32, 4
; FLATSCR-NEXT: v_mov_b32_e32 v0, 9
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
-; FLATSCR-NEXT: s_add_u32 s0, s32, 8
+; FLATSCR-NEXT: s_add_i32 s0, s32, 8
; FLATSCR-NEXT: v_mov_b32_e32 v0, 10
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
-; FLATSCR-NEXT: s_add_u32 s0, s32, 12
+; FLATSCR-NEXT: s_add_i32 s0, s32, 12
; FLATSCR-NEXT: v_mov_b32_e32 v0, 11
-; FLATSCR-NEXT: s_add_u32 s2, s32, 16
+; FLATSCR-NEXT: s_add_i32 s2, s32, 16
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
; FLATSCR-NEXT: v_mov_b32_e32 v0, 12
; FLATSCR-NEXT: s_getpc_b64 s[0:1]
@@ -189,13 +189,13 @@ define amdgpu_kernel void @kernel_caller_byval() {
; FLATSCR-NEXT: s_getpc_b64 s[0:1]
; FLATSCR-NEXT: s_add_u32 s0, s0, external_void_func_byval@rel32@lo+4
; FLATSCR-NEXT: s_addc_u32 s1, s1, external_void_func_byval@rel32@hi+12
-; FLATSCR-NEXT: s_add_u32 s2, s32, 8
-; FLATSCR-NEXT: s_add_u32 s3, s32, 16
-; FLATSCR-NEXT: s_add_u32 s4, s32, 24
-; FLATSCR-NEXT: s_add_u32 s5, s32, 32
-; FLATSCR-NEXT: s_add_u32 s6, s32, 40
-; FLATSCR-NEXT: s_add_u32 s7, s32, 48
-; FLATSCR-NEXT: s_add_u32 s8, s32, 56
+; FLATSCR-NEXT: s_add_i32 s2, s32, 8
+; FLATSCR-NEXT: s_add_i32 s3, s32, 16
+; FLATSCR-NEXT: s_add_i32 s4, s32, 24
+; FLATSCR-NEXT: s_add_i32 s5, s32, 32
+; FLATSCR-NEXT: s_add_i32 s6, s32, 40
+; FLATSCR-NEXT: s_add_i32 s7, s32, 48
+; FLATSCR-NEXT: s_add_i32 s8, s32, 56
; FLATSCR-NEXT: s_waitcnt vmcnt(7)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[0:1], s32
; FLATSCR-NEXT: s_waitcnt vmcnt(7)
@@ -266,16 +266,16 @@ define void @func_caller_stack() {
; FLATSCR-NEXT: s_mov_b64 exec, s[2:3]
; FLATSCR-NEXT: s_add_i32 s32, s32, 16
; FLATSCR-NEXT: v_writelane_b32 v40, s0, 2
-; FLATSCR-NEXT: s_add_u32 s0, s32, 4
+; FLATSCR-NEXT: s_add_i32 s0, s32, 4
; FLATSCR-NEXT: v_mov_b32_e32 v0, 9
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
-; FLATSCR-NEXT: s_add_u32 s0, s32, 8
+; FLATSCR-NEXT: s_add_i32 s0, s32, 8
; FLATSCR-NEXT: v_mov_b32_e32 v0, 10
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
-; FLATSCR-NEXT: s_add_u32 s0, s32, 12
+; FLATSCR-NEXT: s_add_i32 s0, s32, 12
; FLATSCR-NEXT: v_mov_b32_e32 v0, 11
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
-; FLATSCR-NEXT: s_add_u32 s0, s32, 16
+; FLATSCR-NEXT: s_add_i32 s0, s32, 16
; FLATSCR-NEXT: v_mov_b32_e32 v0, 12
; FLATSCR-NEXT: v_writelane_b32 v40, s30, 0
; FLATSCR-NEXT: scratch_store_dword off, v0, s0
@@ -393,8 +393,8 @@ define void @func_caller_byval(ptr addrspace(5) %argptr) {
; FLATSCR-NEXT: s_add_i32 s32, s32, 16
; FLATSCR-NEXT: v_add_u32_e32 v3, 8, v0
; FLATSCR-NEXT: v_writelane_b32 v40, s0, 2
-; FLATSCR-NEXT: s_add_u32 s0, s32, 8
-; FLATSCR-NEXT: s_add_u32 s2, s32, 56
+; FLATSCR-NEXT: s_add_i32 s0, s32, 8
+; FLATSCR-NEXT: s_add_i32 s2, s32, 56
; FLATSCR-NEXT: v_writelane_b32 v40, s30, 0
; FLATSCR-NEXT: v_writelane_b32 v40, s31, 1
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
@@ -404,28 +404,28 @@ define void @func_caller_byval(ptr addrspace(5) %argptr) {
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[1:2], s0
; FLATSCR-NEXT: scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT: s_add_u32 s0, s32, 16
+; FLATSCR-NEXT: s_add_i32 s0, s32, 16
; FLATSCR-NEXT: v_add_u32_e32 v3, 24, v0
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[1:2], s0
; FLATSCR-NEXT: scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT: s_add_u32 s0, s32, 24
+; FLATSCR-NEXT: s_add_i32 s0, s32, 24
; FLATSCR-NEXT: v_add_u32_e32 v3, 32, v0
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[1:2], s0
; FLATSCR-NEXT: scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT: s_add_u32 s0, s32, 32
+; FLATSCR-NEXT: s_add_i32 s0, s32, 32
; FLATSCR-NEXT: v_add_u32_e32 v3, 40, v0
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[1:2], s0
; FLATSCR-NEXT: scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT: s_add_u32 s0, s32, 40
+; FLATSCR-NEXT: s_add_i32 s0, s32, 40
; FLATSCR-NEXT: v_add_u32_e32 v3, 48, v0
; FLATSCR-NEXT: v_add_u32_e32 v0, 56, v0
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[1:2], s0
; FLATSCR-NEXT: scratch_load_dwordx2 v[1:2], v3, off
-; FLATSCR-NEXT: s_add_u32 s0, s32, 48
+; FLATSCR-NEXT: s_add_i32 s0, s32, 48
; FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[1:2], s0
; FLATSCR-NEXT: scratch_load_dwordx2 v[0:1], v0, off
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
index 6b767d9e754be..a1bb8b390847f 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
@@ -20,7 +20,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align4(i32 %n) {
; GFX9-NEXT: v_mov_b32_e32 v1, s4
; GFX9-NEXT: s_lshl_b32 s5, s5, 6
; GFX9-NEXT: s_mov_b32 s33, 0
-; GFX9-NEXT: s_add_u32 s32, s4, s5
+; GFX9-NEXT: s_add_i32 s32, s4, s5
; GFX9-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
; GFX9-NEXT: s_endpgm
;
@@ -39,7 +39,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align4(i32 %n) {
; GFX10-NEXT: s_lshl2_add_u32 s5, s5, 15
; GFX10-NEXT: s_and_b32 s5, s5, -16
; GFX10-NEXT: s_lshl_b32 s5, s5, 5
-; GFX10-NEXT: s_add_u32 s32, s4, s5
+; GFX10-NEXT: s_add_i32 s32, s4, s5
; GFX10-NEXT: s_endpgm
;
; GFX11-LABEL: kernel_dynamic_stackalloc_sgpr_align4:
@@ -56,7 +56,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align4(i32 %n) {
; GFX11-NEXT: s_and_b32 s1, s1, -16
; GFX11-NEXT: s_lshl_b32 s1, s1, 5
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT: s_add_u32 s32, s0, s1
+; GFX11-NEXT: s_add_i32 s32, s0, s1
; GFX11-NEXT: s_endpgm
%alloca = alloca i32, i32 %n, align 4, addrspace(5)
store i32 0, ptr addrspace(5) %alloca
@@ -84,7 +84,7 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX9-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX9-NEXT: s_and_b32 s4, s4, -16
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
-; GFX9-NEXT: s_add_u32 s32, s6, s4
+; GFX9-NEXT: s_add_i32 s32, s6, s4
; GFX9-NEXT: s_mov_b32 s32, s33
; GFX9-NEXT: s_mov_b32 s33, s7
; GFX9-NEXT: s_waitcnt vmcnt(0)
@@ -110,7 +110,7 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX10-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
-; GFX10-NEXT: s_add_u32 s32, s6, s4
+; GFX10-NEXT: s_add_i32 s32, s6, s4
; GFX10-NEXT: s_mov_b32 s32, s33
; GFX10-NEXT: s_mov_b32 s33, s7
; GFX10-NEXT: s_setpc_b64 s[30:31]
@@ -136,7 +136,7 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX11-NEXT: s_and_b32 s0, s0, -16
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT: s_add_u32 s32, s2, s0
+; GFX11-NEXT: s_add_i32 s32, s2, s0
; GFX11-NEXT: s_mov_b32 s32, s33
; GFX11-NEXT: s_mov_b32 s33, s3
; GFX11-NEXT: s_setpc_b64 s[30:31]
@@ -161,7 +161,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align16(i32 %n) {
; GFX9-NEXT: v_mov_b32_e32 v1, s4
; GFX9-NEXT: s_lshl_b32 s5, s5, 6
; GFX9-NEXT: s_mov_b32 s33, 0
-; GFX9-NEXT: s_add_u32 s32, s4, s5
+; GFX9-NEXT: s_add_i32 s32, s4, s5
; GFX9-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
; GFX9-NEXT: s_endpgm
;
@@ -180,7 +180,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align16(i32 %n) {
; GFX10-NEXT: s_lshl2_add_u32 s5, s5, 15
; GFX10-NEXT: s_and_b32 s5, s5, -16
; GFX10-NEXT: s_lshl_b32 s5, s5, 5
-; GFX10-NEXT: s_add_u32 s32, s4, s5
+; GFX10-NEXT: s_add_i32 s32, s4, s5
; GFX10-NEXT: s_endpgm
;
; GFX11-LABEL: kernel_dynamic_stackalloc_sgpr_align16:
@@ -197,7 +197,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align16(i32 %n) {
; GFX11-NEXT: s_and_b32 s1, s1, -16
; GFX11-NEXT: s_lshl_b32 s1, s1, 5
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT: s_add_u32 s32, s0, s1
+; GFX11-NEXT: s_add_i32 s32, s0, s1
; GFX11-NEXT: s_endpgm
%alloca = alloca i32, i32 %n, align 16, addrspace(5)
store i32 0, ptr addrspace(5) %alloca
@@ -225,7 +225,7 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX9-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX9-NEXT: s_and_b32 s4, s4, -16
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
-; GFX9-NEXT: s_add_u32 s32, s6, s4
+; GFX9-NEXT: s_add_i32 s32, s6, s4
; GFX9-NEXT: s_mov_b32 s32, s33
; GFX9-NEXT: s_mov_b32 s33, s7
; GFX9-NEXT: s_waitcnt vmcnt(0)
@@ -251,7 +251,7 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX10-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
-; GFX10-NEXT: s_add_u32 s32, s6, s4
+; GFX10-NEXT: s_add_i32 s32, s6, s4
; GFX10-NEXT: s_mov_b32 s32, s33
; GFX10-NEXT: s_mov_b32 s33, s7
; GFX10-NEXT: s_setpc_b64 s[30:31]
@@ -277,7 +277,7 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX11-NEXT: s_and_b32 s0, s0, -16
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT: s_add_u32 s32, s2, s0
+; GFX11-NEXT: s_add_i32 s32, s2, s0
; GFX11-NEXT: s_mov_b32 s32, s33
; GFX11-NEXT: s_mov_b32 s33, s3
; GFX11-NEXT: s_setpc_b64 s[30:31]
@@ -294,7 +294,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
; GFX9-NEXT: s_movk_i32 s32, 0x800
; GFX9-NEXT: s_add_u32 s0, s0, s17
; GFX9-NEXT: s_addc_u32 s1, s1, 0
-; GFX9-NEXT: s_add_u32 s5, s32, 0x7ff
+; GFX9-NEXT: s_add_i32 s5, s32, 0x7ff
; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX9-NEXT: s_and_b32 s5, s5, 0xfffff800
@@ -303,7 +303,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
; GFX9-NEXT: v_mov_b32_e32 v1, s5
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
; GFX9-NEXT: s_mov_b32 s33, 0
-; GFX9-NEXT: s_add_u32 s32, s5, s4
+; GFX9-NEXT: s_add_i32 s32, s5, s4
; GFX9-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
; GFX9-NEXT: s_endpgm
;
@@ -313,7 +313,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
; GFX10-NEXT: s_movk_i32 s32, 0x400
; GFX10-NEXT: s_add_u32 s0, s0, s17
; GFX10-NEXT: s_addc_u32 s1, s1, 0
-; GFX10-NEXT: s_add_u32 s5, s32, 0x3ff
+; GFX10-NEXT: s_add_i32 s5, s32, 0x3ff
; GFX10-NEXT: v_mov_b32_e32 v0, 0
; GFX10-NEXT: s_and_b32 s5, s5, 0xfffffc00
; GFX10-NEXT: s_mov_b32 s33, 0
@@ -323,7 +323,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
; GFX10-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
-; GFX10-NEXT: s_add_u32 s32, s5, s4
+; GFX10-NEXT: s_add_i32 s32, s5, s4
; GFX10-NEXT: s_endpgm
;
; GFX11-LABEL: kernel_dynamic_stackalloc_sgpr_align32:
@@ -331,7 +331,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
; GFX11-NEXT: s_load_b32 s0, s[4:5], 0x0
; GFX11-NEXT: s_mov_b32 s32, 32
; GFX11-NEXT: v_mov_b32_e32 v0, 0
-; GFX11-NEXT: s_add_u32 s1, s32, 0x3ff
+; GFX11-NEXT: s_add_i32 s1, s32, 0x3ff
; GFX11-NEXT: s_mov_b32 s33, 0
; GFX11-NEXT: s_and_b32 s1, s1, 0xfffffc00
; GFX11-NEXT: scratch_store_b32 off, v0, s1
@@ -341,7 +341,7 @@ define amdgpu_kernel void @kernel_dynamic_stackalloc_sgpr_align32(i32 %n) {
; GFX11-NEXT: s_and_b32 s0, s0, -16
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT: s_add_u32 s32, s1, s0
+; GFX11-NEXT: s_add_i32 s32, s1, s0
; GFX11-NEXT: s_endpgm
%alloca = alloca i32, i32 %n, align 32, addrspace(5)
store i32 0, ptr addrspace(5) %alloca
@@ -366,7 +366,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX9-NEXT: s_mov_b32 s33, s6
; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: s_load_dword s4, s[4:5], 0x0
-; GFX9-NEXT: s_add_u32 s5, s32, 0x7ff
+; GFX9-NEXT: s_add_i32 s5, s32, 0x7ff
; GFX9-NEXT: s_and_b32 s5, s5, 0xfffff800
; GFX9-NEXT: v_mov_b32_e32 v1, s5
; GFX9-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
@@ -374,7 +374,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX9-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX9-NEXT: s_and_b32 s4, s4, -16
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
-; GFX9-NEXT: s_add_u32 s32, s5, s4
+; GFX9-NEXT: s_add_i32 s32, s5, s4
; GFX9-NEXT: s_mov_b32 s32, s34
; GFX9-NEXT: s_mov_b32 s34, s7
; GFX9-NEXT: s_waitcnt vmcnt(0)
@@ -397,7 +397,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX10-NEXT: s_mov_b32 s33, s6
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_load_dword s4, s[4:5], 0x0
-; GFX10-NEXT: s_add_u32 s5, s32, 0x3ff
+; GFX10-NEXT: s_add_i32 s5, s32, 0x3ff
; GFX10-NEXT: s_and_b32 s5, s5, 0xfffffc00
; GFX10-NEXT: v_mov_b32_e32 v1, s5
; GFX10-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
@@ -405,7 +405,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX10-NEXT: s_lshl2_add_u32 s4, s4, 15
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
-; GFX10-NEXT: s_add_u32 s32, s5, s4
+; GFX10-NEXT: s_add_i32 s32, s5, s4
; GFX10-NEXT: s_mov_b32 s32, s34
; GFX10-NEXT: s_mov_b32 s34, s7
; GFX10-NEXT: s_setpc_b64 s[30:31]
@@ -427,7 +427,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX11-NEXT: s_mov_b32 s33, s2
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-NEXT: s_load_b32 s0, s[0:1], 0x0
-; GFX11-NEXT: s_add_u32 s1, s32, 0x3ff
+; GFX11-NEXT: s_add_i32 s1, s32, 0x3ff
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(SALU_CYCLE_1)
; GFX11-NEXT: s_and_b32 s1, s1, 0xfffffc00
; GFX11-NEXT: scratch_store_b32 off, v0, s1
@@ -436,7 +436,7 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX11-NEXT: s_and_b32 s0, s0, -16
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
-; GFX11-NEXT: s_add_u32 s32, s1, s0
+; GFX11-NEXT: s_add_i32 s32, s1, s0
; GFX11-NEXT: s_mov_b32 s32, s34
; GFX11-NEXT: s_mov_b32 s34, s3
; GFX11-NEXT: s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
index 8a80afd4a768f..d1083588e8ac0 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
@@ -855,7 +855,7 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
; GFX9-NEXT: s_lshl_b32 s0, s0, 7
; GFX9-NEXT: v_sub_u32_e32 v0, 0, v0
; GFX9-NEXT: v_add_u32_e32 v1, 0x100, v1
-; GFX9-NEXT: s_add_u32 s0, 0x100, s0
+; GFX9-NEXT: s_addk_i32 s0, 0x100
; GFX9-NEXT: v_mov_b32_e32 v2, 15
; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX9-NEXT: scratch_store_dword v1, v2, off offset:128
@@ -883,7 +883,7 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_lshl_b32 s0, s0, 7
-; GFX10-NEXT: s_add_u32 s0, 0x100, s0
+; GFX10-NEXT: s_addk_i32 s0, 0x100
; GFX10-NEXT: v_add_nc_u32_e32 v1, s0, v1
; GFX10-NEXT: scratch_load_dword v0, v1, off offset:124 glc dlc
; GFX10-NEXT: s_waitcnt vmcnt(0)
@@ -899,7 +899,7 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
; GFX942-NEXT: v_sub_u32_e32 v0, 0, v0
; GFX942-NEXT: s_waitcnt lgkmcnt(0)
; GFX942-NEXT: s_lshl_b32 s0, s0, 7
-; GFX942-NEXT: s_add_u32 s0, 0x100, s0
+; GFX942-NEXT: s_addk_i32 s0, 0x100
; GFX942-NEX...
[truncated]
|
GlobalISel should not select the pseudos. We should directly implement the legality rules for 64-bit pointer arithmetic |
I think the only plus is it removes one useless entry from the allocation order, since exec is always reserved and will never be used for the assignment |
Ping. |
63f76a4
to
d62ce07
Compare
Rebase. |
@arsenm, do you see anything that should still be improved about this patch? |
d62ce07
to
58ac8e0
Compare
Rebase, to get a fix for a failing Linux CI test. |
Fixed now, sorry for all the noise. |
Merge activity
|
Enable generation of PTRADD SelectionDAG nodes for pointer arithmetic for SI, for now behind an internal CLI option. Also add basic patterns to match these nodes. Optimizations will come in follow-up PRs. Basic tests for SDAG codegen with PTRADD are in test/CodeGen/AMDGPU/ptradd-sdag.ll Only affects 64-bit address spaces for now, since the immediate use case only affects the flat address space. For SWDEV-516125.
Enable generation of PTRADD SelectionDAG nodes for pointer arithmetic for SI, for now behind an internal CLI option. Also add basic patterns to match these nodes. Optimizations will come in follow-up PRs. Basic tests for SDAG codegen with PTRADD are in test/CodeGen/AMDGPU/ptradd-sdag.ll Only affects 64-bit address spaces for now, since the immediate use case only affects the flat address space. For SWDEV-516125.
These functions are for building G_PTR_ADDs when we know that the base pointer and the result are both valid pointers into (or just after) the same object. They are similar to SelectionDAG::getObjectPtrOffset. This PR also changes call sites of the generic (build|materialize)PtrAdd functions that implement pointer arithmetic to split large memory accesses to the new functions. Since memory accesses have to fit into an object in memory, pointer arithmetic to an offset into a large memory access also yields an address in that object. Currently, these (build|materialize)ObjectPtrOffset functions only add "nuw" to the generated G_PTR_ADD, but I intend to introduce an "inbounds" MIFlag in a later PR (analogous to a concurrent effort in SDAG: #131862, related: #140017, #141725) that will also be set in the (build|materialize)ObjectPtrOffset functions. Most test changes just add "nuw" to G_PTR_ADDs. Exceptions are AMDGPU's call-outgoing-stack-args.ll, flat-scratch.ll, and freeze.ll tests, where offsets are now folded into scratch instructions, and cases where the behavior of the check regeneration script changed, resulting, e.g., in better checks for "nusw G_PTR_ADD" instructions, matched empty lines, and the use of "CHECK-NEXT" in MIPS tests. For SWDEV-516125.
This flag applies to G_PTR_ADD instructions and indicates that the operation implements an inbounds getelementptr operation, i.e., the pointer operand is in bounds wrt. the allocated object it is based on, and the arithmetic does not change that. It is set when the IRTranslator lowers inbounds GEPs (currently only in some cases, to be extended with a future PR), and in the (build|materialize)ObjectPtrOffset functions. Inbounds information is useful in ISel when we have instructions that perform address computations whose intermediate steps must be in the same memory region as the final result. A follow-up patch will start using it for AMDGPU's flat memory instructions, where the immediate offset must not affect the memory aperture of the address. This is analogous to a concurrent effort in SDAG: #131862 (related: #140017, #141725). For SWDEV-516125.
These functions are for building G_PTR_ADDs when we know that the base pointer and the result are both valid pointers into (or just after) the same object. They are similar to SelectionDAG::getObjectPtrOffset. This PR also changes call sites of the generic (build|materialize)PtrAdd functions that implement pointer arithmetic to split large memory accesses to the new functions. Since memory accesses have to fit into an object in memory, pointer arithmetic to an offset into a large memory access also yields an address in that object. Currently, these (build|materialize)ObjectPtrOffset functions only add "nuw" to the generated G_PTR_ADD, but I intend to introduce an "inbounds" MIFlag in a later PR (analogous to a concurrent effort in SDAG: llvm#131862, related: llvm#140017, llvm#141725) that will also be set in the (build|materialize)ObjectPtrOffset functions. Most test changes just add "nuw" to G_PTR_ADDs. Exceptions are AMDGPU's call-outgoing-stack-args.ll, flat-scratch.ll, and freeze.ll tests, where offsets are now folded into scratch instructions, and cases where the behavior of the check regeneration script changed, resulting, e.g., in better checks for "nusw G_PTR_ADD" instructions, matched empty lines, and the use of "CHECK-NEXT" in MIPS tests. For SWDEV-516125.
These functions are for building G_PTR_ADDs when we know that the base pointer and the result are both valid pointers into (or just after) the same object. They are similar to SelectionDAG::getObjectPtrOffset. This PR also changes call sites of the generic (build|materialize)PtrAdd functions that implement pointer arithmetic to split large memory accesses to the new functions. Since memory accesses have to fit into an object in memory, pointer arithmetic to an offset into a large memory access also yields an address in that object. Currently, these (build|materialize)ObjectPtrOffset functions only add "nuw" to the generated G_PTR_ADD, but I intend to introduce an "inbounds" MIFlag in a later PR (analogous to a concurrent effort in SDAG: #131862, related: #140017, #141725) that will also be set in the (build|materialize)ObjectPtrOffset functions. Most test changes just add "nuw" to G_PTR_ADDs. Exceptions are AMDGPU's call-outgoing-stack-args.ll, flat-scratch.ll, and freeze.ll tests, where offsets are now folded into scratch instructions, and cases where the behavior of the check regeneration script changed, resulting, e.g., in better checks for "nusw G_PTR_ADD" instructions, matched empty lines, and the use of "CHECK-NEXT" in MIPS tests. For SWDEV-516125.
#150392) These functions are for building G_PTR_ADDs when we know that the base pointer and the result are both valid pointers into (or just after) the same object. They are similar to SelectionDAG::getObjectPtrOffset. This PR also changes call sites of the generic (build|materialize)PtrAdd functions that implement pointer arithmetic to split large memory accesses to the new functions. Since memory accesses have to fit into an object in memory, pointer arithmetic to an offset into a large memory access also yields an address in that object. Currently, these (build|materialize)ObjectPtrOffset functions only add "nuw" to the generated G_PTR_ADD, but I intend to introduce an "inbounds" MIFlag in a later PR (analogous to a concurrent effort in SDAG: #131862, related: #140017, #141725) that will also be set in the (build|materialize)ObjectPtrOffset functions. Most test changes just add "nuw" to G_PTR_ADDs. Exceptions are AMDGPU's call-outgoing-stack-args.ll, flat-scratch.ll, and freeze.ll tests, where offsets are now folded into scratch instructions, and cases where the behavior of the check regeneration script changed, resulting, e.g., in better checks for "nusw G_PTR_ADD" instructions, matched empty lines, and the use of "CHECK-NEXT" in MIPS tests. For SWDEV-516125.
This flag applies to G_PTR_ADD instructions and indicates that the operation implements an inbounds getelementptr operation, i.e., the pointer operand is in bounds wrt. the allocated object it is based on, and the arithmetic does not change that. It is set when the IRTranslator lowers inbounds GEPs (currently only in some cases, to be extended with a future PR), and in the (build|materialize)ObjectPtrOffset functions. Inbounds information is useful in ISel when we have instructions that perform address computations whose intermediate steps must be in the same memory region as the final result. A follow-up patch will start using it for AMDGPU's flat memory instructions, where the immediate offset must not affect the memory aperture of the address. This is analogous to a concurrent effort in SDAG: #131862 (related: #140017, #141725). For SWDEV-516125.
This flag applies to G_PTR_ADD instructions and indicates that the operation implements an inbounds getelementptr operation, i.e., the pointer operand is in bounds wrt. the allocated object it is based on, and the arithmetic does not change that. It is set when the IRTranslator lowers inbounds GEPs (currently only in some cases, to be extended with a future PR), and in the (build|materialize)ObjectPtrOffset functions. Inbounds information is useful in ISel when we have instructions that perform address computations whose intermediate steps must be in the same memory region as the final result. A follow-up patch will start using it for AMDGPU's flat memory instructions, where the immediate offset must not affect the memory aperture of the address. This is analogous to a concurrent effort in SDAG: #131862 (related: #140017, #141725). For SWDEV-516125.
Enable generation of PTRADD SelectionDAG nodes for pointer arithmetic for SI,
for now behind an internal CLI option. Also add basic patterns to match these
nodes. Optimizations will come in follow-up PRs. Basic tests for SDAG codegen
with PTRADD are in test/CodeGen/AMDGPU/ptradd-sdag.ll
Only affects 64-bit address spaces for now, since the immediate use case only
affects the flat address space.
For SWDEV-516125.