Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower GetElement on arm64 to the correct access sequence #104288

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

tannergooding
Copy link
Member

Unlike xarch where this could be handled a bit more trivially in codegen, Arm64 isn't as flexible and so we get better and more correct codegen by explicitly lowering to the correct sequence instead.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 2, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding tannergooding force-pushed the simd-getelement branch 3 times, most recently from a3b0300 to 95fdd44 Compare July 2, 2024 07:49
@tannergooding
Copy link
Member Author

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tannergooding tannergooding force-pushed the simd-getelement branch 3 times, most recently from 263de58 to 244c8da Compare July 3, 2024 05:56
@tannergooding
Copy link
Member Author

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tannergooding tannergooding marked this pull request as ready for review July 3, 2024 18:45
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib, This is the "proper" fix for #104232, which was worked around in #104264.

No TP diff, but good asmdiff, such as for Linux Arm64:

Overall (-10,228 bytes)
MinOpts (-4,844 bytes)
FullOpts (-5,384 bytes)

It looks like there's more improvements to be had, but those can be handled separately I think. An example is we're generating:

-            ldr     q16, [fp, #-0x68]	// [V98 tmp78]
-            dup     s16, v16.s[1]
-            ldr     q18, [fp, #-0x68]	// [V98 tmp78]
-            dup     s18, v18.s[2]
-            ldr     q19, [fp, #-0x68]	// [V98 tmp78]
-            dup     s19, v19.s[3]
+            sub     x1, fp, #104	// [V98 tmp78]
+            ; byrRegs +[x1]
+            ldr     s16, [x1, #0x04]
+            sub     x1, fp, #104	// [V98 tmp78]
+            ldr     s18, [x1, #0x08]
+            sub     x1, fp, #104	// [V98 tmp78]
+            ldr     s19, [x1, #0x0C]

This is a net improvement as we're only loading scalars, rather than repeatedly loading vectors, but it does represent a case that probably should've been represented as:

ldr     s16, [fp, #-0x64]
ldr     s18, [fp, #-0x60]
ldr     s19, [fp, #-0x5C]

I think there's just a general optimization missing here when the LclAddr itself has an offset and we're adding a constant offset (so its still representable as a single AddrMode)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant