Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure: System.Tests.GCTests.KeepAlive_Null #104218

Open
v-wenyuxu opened this issue Jul 1, 2024 · 19 comments
Open

Test failure: System.Tests.GCTests.KeepAlive_Null #104218

v-wenyuxu opened this issue Jul 1, 2024 · 19 comments
Assignees
Labels
area-GC-coreclr blocking-clean-ci-optional Blocking optional rolling runs
Milestone

Comments

@v-wenyuxu
Copy link

Failed in: runtime-coreclr libraries-jitstressregs 20240630.1

Failed tests:

net9.0-linux-Release-x64-jitstressregs0x80-Ubuntu.2204.Amd64.Open
    - System.Tests.GCTests.KeepAlive_Null

Error message:

 Assert.True() Failure
Expected: True
Actual:   False

Stack trace:

   at System.Tests.GCTests.KeepAliveNullTest.Run() in /_/src/libraries/System.Runtime/tests/System.Runtime.Tests/System/GCTests.cs:line 211
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57
@v-wenyuxu v-wenyuxu added os-linux Linux OS (any supported distro) JitStress CLR JIT issues involving JIT internal stress modes arch-x64 blocking-clean-ci-optional Blocking optional rolling runs labels Jul 1, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 1, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jul 1, 2024
@EgorBo
Copy link
Member

EgorBo commented Jul 1, 2024

Doesn't look to be jitstress specific, fails on:

using System.Runtime.CompilerServices;
using Xunit;

class KeepAliveNullTest
{
    private static void Main()
    {
        for (int i = 0; i < 200000; i++)
        {
            TestObject.Finalized = false;
            MakeAndNull();
            GC.Collect();
            GC.WaitForPendingFinalizers();
            Assert.True(TestObject.Finalized);
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void MakeAndNull()
    {
        var deadObj = new TestObject();
        // it's dead here
    }

    public class TestObject
    {
        public static bool Finalized { get; set; }

        ~TestObject() => Finalized = true;
    }
}

with DOTNET_TieredCompilation=0. @dotnet/gc @VSadov any recent changes in GC could break it? It used to pass..

Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 1, 2024
@VSadov
Copy link
Member

VSadov commented Jul 1, 2024

Unless [MethodImpl(MethodImplOptions.NoInlining)] is not respected or the allocation of the object is somehow optimized away, I do not see how JIT or root reporting could cause any changes in behavior here.

This is a very simple case. The object gets unreachable and GC.Collect() should see that.

Could it be that NoInlining is not working?

@EgorBo
Copy link
Member

EgorBo commented Jul 1, 2024

Unless [MethodImpl(MethodImplOptions.NoInlining)] is not respected or the allocation of the object is somehow optimized away, I do not see how JIT or root reporting could cause any changes in behavior here.

This is a very simple case. The object gets unreachable and GC.Collect() should see that.

Could it be that NoInlining is not working?

I've just checked - both NoInlining is correctly respected and the allocation is not optimized away (we normally do remove such allocations unless their classes have finalizers). So sounds like some change in GC's behavior

@EgorBo
Copy link
Member

EgorBo commented Jul 1, 2024

Seems like the DATAS is the culprit, the test doesn't fail with DOTNET_GCDynamicAdaptationMode=0

@jakobbotsch
Copy link
Member

My initial bisection suggests f69972f...d2dbdd0, though the repro is intermittent so it's possible I just haven't seen it repro with f69972f yet. In any case if that's the range #103501 looks most interesting, cc @jkotas

@jkotas
Copy link
Member

jkotas commented Jul 1, 2024

I do not think that the problem is with GC not collecting the object. I think the problem is in the WaitForPendingFinalizers() synchronization. If I change the loop to be:

        for (int i = 0; i < 20000000; i++)
        {
            TestObject.Finalized = false;
            MakeAndNull();
            GC.Collect();
            GC.WaitForPendingFinalizers();
            if (!TestObject.Finalized) Console.WriteLine("FAIL");
            if (!TestObject.Finalized) Console.WriteLine("FAIL2");
         }

I never see FAIL + FAIL2 printed. I always only see FAIL printed. It suggests that the problem is in GC.WaitForPendingFinalizers returning too early before the finalizer got a chance to run.

The code has a comment about such race condition and considers it acceptable:

// Anyone waiting to drain the Q can now wake up. Note that there is a
// race in that another thread starting a drain, as we leave a drain, may
// consider itself satisfied by the drain that just completed. This is
// acceptable.

#103501

This change made finalization loop more efficient. It makes any race conditions related to finalization more likely to show up.

@EgorBo
Copy link
Member

EgorBo commented Jul 1, 2024

The code has a comment about such race condition and considers it acceptable:

While it's understandable, I bet we have a ton of test cases here and there relying on WaitForPendingFinalizers calling the finalizers, effectively, synchronously (all of them are guarded by "is gc precise" flag to ignore mono). What I am saying is that we might see more flaky test failures like this then

@VSadov
Copy link
Member

VSadov commented Jul 1, 2024

I think that race goes like:

  • Some unrelated GC happens, completes, and finalization cycle starts
  • Finalization thread drains the queue and is just about to report that it is done draining, but gets preempted
  • User thread makes allocations, calls Collect.
  • Collect runs, pends more items into the queue
  • User thread tries to wait on the queue-empty event
  • The Finalization thread gets to run and signals the event (while the queue is no longer empty)

But can this happen in a completely single-threaded test? (and as I understand happens fairly reliably now)
Who causes "some unrelated GC happens"?

@jkotas
Copy link
Member

jkotas commented Jul 1, 2024

Who causes "some unrelated GC happens"?

Background event source activity. There is a ton of it on typical machine.

@VSadov
Copy link
Member

VSadov commented Jul 2, 2024

I have seen code where sequence Collect(); WaitForPendingFinalizers(); is always called twice (or three times) in a row.
My thought was always - "why not 4 or 42 times? was N enough to pass tests?" :-)

@VSadov
Copy link
Member

VSadov commented Jul 2, 2024

Perhaps the WFPF should check if the queue is empty and if not, wait on the event one more time before returning?

This API does not need to be very performant, but being less racey, if possible, could be good.

@jkotas
Copy link
Member

jkotas commented Jul 2, 2024

Can we add some flag or counter to guarantee that GC.WaitForPendingFinalizers call always waits for full turn of the crank?

@v-wenyuxu

This comment was marked as off-topic.

@jkotas

This comment was marked as off-topic.

@jkotas jkotas removed os-linux Linux OS (any supported distro) JitStress CLR JIT issues involving JIT internal stress modes arch-x64 labels Jul 2, 2024
@jkotas
Copy link
Member

jkotas commented Jul 2, 2024

@VSadov Do you plan to work on this as part of your effort to improve reliability of tests that use GC.WaitForPendingFinalizers under stress conditions?

@VSadov
Copy link
Member

VSadov commented Jul 2, 2024

I can take a look, but I think this should be a separate change.
The reliability of stress PR changes only test-specific code and would not have effects on actual behavior of apps. The bar to the changes is somewhat lower. We can iterate on that further if we do not like the effects. We would not need specific testcases for that, since it is effectively just test infra changes.

What we have here is a problem with actual behavior of the feature. Ideally the fix is just removing the race, so it is unobservable to the user code, but the change would need a bit more careful.
It may use a testcase as well. Perhaps something like #104218 (comment)

@jkotas
Copy link
Member

jkotas commented Jul 2, 2024

I can take a look, but I think this should be a separate change.

Yes, I agree it should be a separate change.

@v-wenyuxu
Copy link
Author

Failed in: runtime-coreclr libraries-jitstress 20240702.1

Failed tests:

net9.0-windows-Release-x86-tailcallstress-Windows.10.Amd64.Open
    - System.Collections.Concurrent.Tests.ConcurrentQueueTests.ReferenceTypes_NulledAfterDequeue
net9.0-windows-Release-x64-no_tiered_compilation-Windows.10.Amd64.Open
    - System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNonCanonicalNaNTest
    - System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNaNTest

Error message:

 Assert.True() Failure
Expected: True
Actual:   False

Stack trace:

   at System.Collections.Concurrent.Tests.ConcurrentQueueTests.ReferenceTypes_NulledAfterDequeue()
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InterpretedInvoke_Method(Object obj, IntPtr* args) in /_/src/coreclr/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.CoreCLR.cs:line 36
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Jul 3, 2024
@mangod9 mangod9 added this to the 9.0.0 milestone Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr blocking-clean-ci-optional Blocking optional rolling runs
Projects
None yet
Development

No branches or pull requests

8 participants