0
Working on Fix

FlowMachine memory leak

SimonFireproof 2 months ago updated by RobD 5 hours ago 26 1 duplicate

I'm investigating a memory leak in our game and it seems to be due to FlowMachine handing around in memory and holding onto references to components. This is after loading into a new scene. I've got to the point where I've got a very simple scene in our game with one Bolt graph with a single object reference. I then load and unload that scene use the Unity Memory Profiler to see the leak. Deleting the FlowMachine fixes the memory leak.

I'm planning to continue to investigate the problem by setting up a clean project to eliminate our code base from the issue. However I wanted to post here in case anyone is able to shed light on the issue, maybe someone knows about it.

Bolt Version:
1.4.1b3
Unity Version:
2018.4.3f1
Platform(s):
Scripting Backend:
.NET Version (API Compatibility Level):

Duplicates 1

Pending Review

Hi Simon,

Thanks for the report. I'll investigate this with the memory profiler and try to see where Bolt might be holding onto a managed reference past its useful lifetime. 

Pending Review

Hi Djensen,

Thanks for the report and sorry about the issue. There has been a similar one last week, I might merge them.

Indeed, I think the cause is as you noticed keeping references in static properties. That's fortunately something very easy to fix.

I must admit I haven't yet had the chance to get familiar with the new memory profiler. I would really appreciate it if you could point out the steps you used to isolate those cases; if you don't have the time don't worry, I'll end up figuring it out, but it might just speed up the process a bit.

Thanks for getting back Lazlo!

For memory profiling, the current Unity stuff is really lacking, so we ended up using this tool: https://forum.unity.com/threads/wip-heap-explorer-memory-profiler-debugger-and-analyzer-for-unity.527949/ - It lets you see paths to GC root, all GC roots, memory snapshots, etc.

Hi djensen, thanks for following up.

That looks great, thanks for the tip. I do remember trying to do some profiling with the built in memory profiler and Unity just straight up hanging and crashing. This looks promising!

Hi Lazlo,

I recreated the memory leak with a fresh project. I used Bolt 1.4.6f3 and Unity 2018.4.3f1.

For the test, I had a simple script that had a public reference to a texture, and in the same scene a Flow Machine that had a reference to that component.

Then in another scene, a script which loaded the scene and then loaded an empty scene. I built a development build and used Unity's memory profiler to confirm the texture was still in memory and referenced by a Bolt object

Let me know if I can help further

Simon


ReferenceToTexture.cs
SceneLoadAndUnload.cs


Hi Lazlo,

Have you made any progress on this bug? It would be great to have an idea about when it will be fixed.

Thanks

Simon

Working on Fix

Hi Simon,

I'm sorry, there is no progress on the issue currently as other priorities came up.

I will give a first jab at it next week, but due to the nature of the issue, I can't guarantee it will be a quick or easy fix. I apologize for the inconvenience.

I have a fix for the leak that works from the outside using reflection to reach the culprits. It was unfortunately hurting us quite a bit, so I arrived at this solution after doing some deeper profiling/debugging sessions. Tested with Bolt 1.4.3f2. @Lazlo it might help you know what to do (At least in the version I fixed against). @Simon, feel free to see if it helps you (disclaimer: You're on a different version than me)

I preferred to clean things up incrementally in a coroutine because the collections could get quite large and I wanted to spread out the iteration a bit. Obviously you can change the timing/behavior to suit your needs if they're different.

BoltLeakFixer.txt

@djensen, thanks very much. That fixes the leak in my simple test case. It seems we're still getting a memory leak where we've used superunit scriptable objects. I'm trying to figure out how to clean them up as well.

@Simon - Our workflow is full of superunits and it fixes those leaks as long as everything relevant is destroyed/nulled, so I'm not sure if there's a version difference going on or something else

Alternatively, you can Clear() those data structures (assuming there aren't other culprits in your version) if you know that nothing is using Bolt at the time, which might be the case since you're changing scenes. The solution I shared is a little more gentle since I had to account for Bolt being used constantly as part of a service.

Either way, good luck!

Thanks for the advice. I'm actually still getting a leak in my simple test project if I use an event unit like Start() (not using superunit). I've also just tried using Clear() on those data structures. It seems like there must be more static data structures hanging about but it's hard to track down.

I'm using Bolt 1.4.6 in that though so it could be a version difference.

I think I'll wait to see how Lazlo gets on

Hi Djensen,

That might fix the leaks, but you're also getting rid of all the data for live editing and the event system every few seconds. From a quick look, that means your events should simply no longer work at all. If you were to do this only once on scene load that might be OK, but in the current state it will break Bolt in many ways from what I can tell.

Thanks for investigating the offending root pointers though, I'll keep those in mind while working on the fix.

Nothing is actually breaking with that leak fix, at least in our use case. Yes, events stop working if I, say, called Clear() while stuff was still running, but if you look closer, it's just removing empty data structures and references to destroyed objects.

I tested the fix against a server running a lot of Bolt scripts with a handful of automated clients hitting it.

Fixed (Unreleased)

Hi Djensen,

I see, I read too fast! That makes a lot of sense.

I am looking into all the static caches in the source (static readonly Dictionary or static readonly HashSet) and making sure they properly release their references. You actually had almost all of them!

  • GraphInstances.byParent: Will now remove the entry by key upon removal of an instance if the instance list is left empty by that operation, releasing the IGraphParent reference.
  • GraphInstances.byGraph: Will now remove the entry by key upon removal of an instance if the instance list is left empty by that operation, releasing the IGraph reference.
  • EventBus.events: Will now remove the entry by key upon removal of a handler if the handler list is left empty by that operation, releasing the EventHook reference, which in turn often holds a Machine reference through its target field.
  • GraphReference.interns: This one is trickier because interns don't ever get "unregistered". I added a new FreeInvalidInterns public method, and hooked it into a new ReferenceCollector static utility class. FreeInvalidInterns will thus automatically be called in Unity's SceneManager.sceneUnloaded callback, so they shouldn't pollute the GC after a scene unload. If you need more frequent cleanups, you can call FreeInvalidInterns whenever.
  • Serialization.awaitingDependers: This one should have been fine already, because the depender gets removed from the waiting list when its dependencies are met. No changes.
  • Serialization.awaitingDependencies: This one was the trickiest, because it's hard to tell when a managed object dependency goes out of scope with Unity's serialization callbacks. I reworked a lot of the dependency system so that basically only ScriptableObjects are dependencies now, which allows me to release their reference in OnDestroy. There also won't be any intermediary managed dependencies like IGraphElements anymore, so less references to keep track of.

I've yet to test all of these changes as they are pretty core, but they should fix most if not all the leaks you had been experiencing so far.

I'll report back after trying your test case with the community memory profiler (the official memory profiler just hangs on me, ugh!).

Hi Lazlo,

Thanks for the quick work. I'm happy to give it a test if you're able to send me the new version (if that helps you at all).

Awesome, Lazlo! Those changes are definitely in line with what I was finding while debugging things. Thank you for jumping on this.

Yeah, the official memory tools hang or crash the app for me. Hopefully the community one helps you validate things. Thanks again!

It's not fixed.

In my test project, you have managed to fix the case where I have a graph with an isolated reference to a component. But if I hook it on to a Start() event and Debug.Log then the leak still happens.

This zip file contains my test project. I used Unity 2018.4.3 and Bolt 1.4.7.

If you make a build and run it, it will load the scene with the Bolt graph, then load into an empty scene and the screen will go red. If you do a memory profile at that point you'll see the texture is still in memory. And you can find that it is still referenced by a managed flow machine reference.

If you do the same thing, but with the Start() node deleted, the memory leak does not happen.

BoltMemoryLeakTest.zip

Pending Review

I'll have another look.

Hi Lazlo,

Have you managed to look at this yet? It's critical for us to get this fixed soon so we need an idea of when you'll be investigating it. If that's not going to happen soon, then we'll need to try and fix it ourselves, so if you have any hints for us please let us know.

Thanks

I've been looking at this and I think I've found a few other problems. Most relate to elements in a GenericPool<> not freeing themselves properly, so you end up with objects in the free list which are still referencing objects.

Recursion needs to clear its traversedOrder and traversedCount structures when Disposed.

GraphPointer needs to clear its graphStack. It doesn't have a Dispose function, so I'm doing it in GraphStack.Dispose() (since GenericPool<GraphStack> is what's actually causing the problem).

I'm not sure those are the *right* fixes, but they do seem to solve problems for us.

We've got one other problem which is a bit trickier. It seems like if you quit a scene while a coroutine is running, the Flow never gets disposed. This is because the component that the coroutine is running on gets destroyed, so Flow.Coroutine never reaches the end of the function (incidentally, I think the "yeild break" in there should just be a yield, but that doesn't actually solve the problem since we never get there anyway). Seems like some higher level piece of code with knowledge of all of the running coroutines needs to come in and clean them up, but I'm struggling to find the best place to do that.

Any thoughts on the above would be much appreciated!

Rob.

I meant that "Yield break" should be "break".

Working on Fix

Hi Simon, Rob,

Sorry for the delay on this. It remains my current priority on Bolt 1.

Thanks for investigating further Rob. Some comments on your findings:

Recursion: 


There are three things that are wrong here. First, traversedOrder is never getting freed. Second, traversedCount is only getting freed when fetching a new recursion object from the pool, whereas it should be freed when disposing the object. Third, Reset is useless and not called anywhere.

I'll change the implementation to this:

GraphStack:

GraphPointer is a base class for both GraphStack and GraphReference. GraphReferences don't need to be freed, because they're treated as immutable objects. However, GraphStacks are created and disposed all the time, so they do need to properly release their pointers.

You say:

GraphPointer needs to clear its graphStack. It doesn't have a Dispose function, so I'm doing it in GraphStack.Dispose() (since GenericPool<GraphStack> is what's actually causing the problem).

What's puzzling me here is that it does (or at least it should) be freeing it already. Here's the implementation in GraphStack.cs:


The Dispose method calls Free on the generic pool, which in turns will call IPoolable.Free, which then clears the graph stack. Here's the relevant part in GenericPool:


Can you let me know how exactly you came to the conclusion that the graph stack wasn't being properly freed? Is it from testing with the profiler or just from an educated guess by reading the code?

The graph stack is showing up in a memory profile after exiting the level (along with availableDependencies, which I don't think your earlier fix fully solved):

(PlayTimelineAndWait is a SuperUnit that we pull into some of our graphs)

I've tried some hacks to solve the current problems, but they're causing other issues. Here's what I've tried:

I've added this function to FlowMachine:

public static void ForceClearBoltData()
{
   //HACKHACKHACK
   Ludiq.GenericPool<Ludiq.GraphStack>.Clear();
   Ludiq.Serialization.awaitingDependers.Clear();
   Ludiq.Serialization.availableDependencies.Clear();
}

Between levels I load an empty scene and then call that (followed by Resource.UnloadUnused)

For the problem with Flows not being disposed if they're in a coroutine when you quit the level, I've changed Flow.StopCoroutine to this:

public void StopCoroutine()
{
   // if (!isCoroutine)
   // {
   //     throw new NotSupportedException("Stop may only be called on coroutines.");
   // }
   //
   // coroutineStopRequested = true;
   if (stack.component != null)
   {
      stack.component.StopAllCoroutines();
      Dispose();
   }
}

This is also a hack - calling it on one flow will stop the coroutines for all of the flows in that FlowMachine. I don't care too much about that though, and it seems to solve the problem (although I had to refactor the calling code in EventUnit a little).

These things combined does seem to solve clear out all of the references, but at the moment it looks like it's preventing events firing properly in the next level. 

This is becoming a major issue for us. Have you had any luck with proper solutions?

If its easier, a brute-force re-initialise function that we can call between levels would be OK for now (similar to my ForceClearBoltData above, but more thorough). 

I'll keep trying to solve this, but we could really use some help from you guys. The code is pretty hard to visualise with all of the templates and interfaces, so we're struggling to see how things are intended to work. Any pointers or solutions you could give us would be hugely appreciated!

Rob.

Hi Rob,

Thanks for the continued follow-up. I fully understand that this is a major issue for you, however I must be careful making core changes to fix it as it can easily break higher level behaviour (as you saw with coroutines).

Going forward, I think it's important we all set up a common set of tests and reproduction cases that we share so that we can actually look at the same data and fixes.

Here's my suggestion for the test environment: 

Test environment:

  • Unity 2018.4.x
  • Bolt v.1.4.7
  • Heap Explorer Beta v.3.1

Please send me step-by-step test cases that cause a leak OR a package file with the test case already setup. If you send me steps, I'll set up the package myself and upload it here so we're all on the same page.

Yeah, that all makes sense. I can't put together a test case right now (it's the end of the day here), but Simon's test from earlier is still valid, and I *think* that if you just set up a graph with a Start event as a coroutine, linked to a WaitForSeconds(100) and then a Debug.Log or similar and exit the scene while that's running, you'll see that the FlowMachine still exists when the scene has exited. Otherwise I'll try to put together some more thorough tests early next week.