+1
Fixed

FlowMachine memory leak

SimonFireproof 1 year ago updated by Lazlo Bonin (Lead Developer) 1 year ago 55 1 duplicate

I'm investigating a memory leak in our game and it seems to be due to FlowMachine handing around in memory and holding onto references to components. This is after loading into a new scene. I've got to the point where I've got a very simple scene in our game with one Bolt graph with a single object reference. I then load and unload that scene use the Unity Memory Profiler to see the leak. Deleting the FlowMachine fixes the memory leak.

I'm planning to continue to investigate the problem by setting up a clean project to eliminate our code base from the issue. However I wanted to post here in case anyone is able to shed light on the issue, maybe someone knows about it.

Bolt Version:
1.4.1b3
Unity Version:
2018.4.3f1
Platform(s):
Scripting Backend:
.NET Version (API Compatibility Level):

Duplicates 1

Pending Review

Hi Simon,

Thanks for the report. I'll investigate this with the memory profiler and try to see where Bolt might be holding onto a managed reference past its useful lifetime. 

Pending Review

Hi Djensen,

Thanks for the report and sorry about the issue. There has been a similar one last week, I might merge them.

Indeed, I think the cause is as you noticed keeping references in static properties. That's fortunately something very easy to fix.

I must admit I haven't yet had the chance to get familiar with the new memory profiler. I would really appreciate it if you could point out the steps you used to isolate those cases; if you don't have the time don't worry, I'll end up figuring it out, but it might just speed up the process a bit.

Thanks for getting back Lazlo!

For memory profiling, the current Unity stuff is really lacking, so we ended up using this tool: https://forum.unity.com/threads/wip-heap-explorer-memory-profiler-debugger-and-analyzer-for-unity.527949/ - It lets you see paths to GC root, all GC roots, memory snapshots, etc.

Hi djensen, thanks for following up.

That looks great, thanks for the tip. I do remember trying to do some profiling with the built in memory profiler and Unity just straight up hanging and crashing. This looks promising!

Hi Lazlo,

I recreated the memory leak with a fresh project. I used Bolt 1.4.6f3 and Unity 2018.4.3f1.

For the test, I had a simple script that had a public reference to a texture, and in the same scene a Flow Machine that had a reference to that component.

Then in another scene, a script which loaded the scene and then loaded an empty scene. I built a development build and used Unity's memory profiler to confirm the texture was still in memory and referenced by a Bolt object

Let me know if I can help further

Simon


ReferenceToTexture.cs
SceneLoadAndUnload.cs


Hi Lazlo,

Have you made any progress on this bug? It would be great to have an idea about when it will be fixed.

Thanks

Simon

Working on Fix

Hi Simon,

I'm sorry, there is no progress on the issue currently as other priorities came up.

I will give a first jab at it next week, but due to the nature of the issue, I can't guarantee it will be a quick or easy fix. I apologize for the inconvenience.

I have a fix for the leak that works from the outside using reflection to reach the culprits. It was unfortunately hurting us quite a bit, so I arrived at this solution after doing some deeper profiling/debugging sessions. Tested with Bolt 1.4.3f2. @Lazlo it might help you know what to do (At least in the version I fixed against). @Simon, feel free to see if it helps you (disclaimer: You're on a different version than me)

I preferred to clean things up incrementally in a coroutine because the collections could get quite large and I wanted to spread out the iteration a bit. Obviously you can change the timing/behavior to suit your needs if they're different.

BoltLeakFixer.txt

@djensen, thanks very much. That fixes the leak in my simple test case. It seems we're still getting a memory leak where we've used superunit scriptable objects. I'm trying to figure out how to clean them up as well.

@Simon - Our workflow is full of superunits and it fixes those leaks as long as everything relevant is destroyed/nulled, so I'm not sure if there's a version difference going on or something else

Alternatively, you can Clear() those data structures (assuming there aren't other culprits in your version) if you know that nothing is using Bolt at the time, which might be the case since you're changing scenes. The solution I shared is a little more gentle since I had to account for Bolt being used constantly as part of a service.

Either way, good luck!

Thanks for the advice. I'm actually still getting a leak in my simple test project if I use an event unit like Start() (not using superunit). I've also just tried using Clear() on those data structures. It seems like there must be more static data structures hanging about but it's hard to track down.

I'm using Bolt 1.4.6 in that though so it could be a version difference.

I think I'll wait to see how Lazlo gets on

Hi Djensen,

That might fix the leaks, but you're also getting rid of all the data for live editing and the event system every few seconds. From a quick look, that means your events should simply no longer work at all. If you were to do this only once on scene load that might be OK, but in the current state it will break Bolt in many ways from what I can tell.

Thanks for investigating the offending root pointers though, I'll keep those in mind while working on the fix.

Nothing is actually breaking with that leak fix, at least in our use case. Yes, events stop working if I, say, called Clear() while stuff was still running, but if you look closer, it's just removing empty data structures and references to destroyed objects.

I tested the fix against a server running a lot of Bolt scripts with a handful of automated clients hitting it.

Fixed (Unreleased)

Hi Djensen,

I see, I read too fast! That makes a lot of sense.

I am looking into all the static caches in the source (static readonly Dictionary or static readonly HashSet) and making sure they properly release their references. You actually had almost all of them!

  • GraphInstances.byParent: Will now remove the entry by key upon removal of an instance if the instance list is left empty by that operation, releasing the IGraphParent reference.
  • GraphInstances.byGraph: Will now remove the entry by key upon removal of an instance if the instance list is left empty by that operation, releasing the IGraph reference.
  • EventBus.events: Will now remove the entry by key upon removal of a handler if the handler list is left empty by that operation, releasing the EventHook reference, which in turn often holds a Machine reference through its target field.
  • GraphReference.interns: This one is trickier because interns don't ever get "unregistered". I added a new FreeInvalidInterns public method, and hooked it into a new ReferenceCollector static utility class. FreeInvalidInterns will thus automatically be called in Unity's SceneManager.sceneUnloaded callback, so they shouldn't pollute the GC after a scene unload. If you need more frequent cleanups, you can call FreeInvalidInterns whenever.
  • Serialization.awaitingDependers: This one should have been fine already, because the depender gets removed from the waiting list when its dependencies are met. No changes.
  • Serialization.awaitingDependencies: This one was the trickiest, because it's hard to tell when a managed object dependency goes out of scope with Unity's serialization callbacks. I reworked a lot of the dependency system so that basically only ScriptableObjects are dependencies now, which allows me to release their reference in OnDestroy. There also won't be any intermediary managed dependencies like IGraphElements anymore, so less references to keep track of.

I've yet to test all of these changes as they are pretty core, but they should fix most if not all the leaks you had been experiencing so far.

I'll report back after trying your test case with the community memory profiler (the official memory profiler just hangs on me, ugh!).

Hi Lazlo,

Thanks for the quick work. I'm happy to give it a test if you're able to send me the new version (if that helps you at all).

Awesome, Lazlo! Those changes are definitely in line with what I was finding while debugging things. Thank you for jumping on this.

Yeah, the official memory tools hang or crash the app for me. Hopefully the community one helps you validate things. Thanks again!

It's not fixed.

In my test project, you have managed to fix the case where I have a graph with an isolated reference to a component. But if I hook it on to a Start() event and Debug.Log then the leak still happens.

This zip file contains my test project. I used Unity 2018.4.3 and Bolt 1.4.7.

If you make a build and run it, it will load the scene with the Bolt graph, then load into an empty scene and the screen will go red. If you do a memory profile at that point you'll see the texture is still in memory. And you can find that it is still referenced by a managed flow machine reference.

If you do the same thing, but with the Start() node deleted, the memory leak does not happen.

BoltMemoryLeakTest.zip

Pending Review

I'll have another look.

Hi Lazlo,

Have you managed to look at this yet? It's critical for us to get this fixed soon so we need an idea of when you'll be investigating it. If that's not going to happen soon, then we'll need to try and fix it ourselves, so if you have any hints for us please let us know.

Thanks

I've been looking at this and I think I've found a few other problems. Most relate to elements in a GenericPool<> not freeing themselves properly, so you end up with objects in the free list which are still referencing objects.

Recursion needs to clear its traversedOrder and traversedCount structures when Disposed.

GraphPointer needs to clear its graphStack. It doesn't have a Dispose function, so I'm doing it in GraphStack.Dispose() (since GenericPool<GraphStack> is what's actually causing the problem).

I'm not sure those are the *right* fixes, but they do seem to solve problems for us.

We've got one other problem which is a bit trickier. It seems like if you quit a scene while a coroutine is running, the Flow never gets disposed. This is because the component that the coroutine is running on gets destroyed, so Flow.Coroutine never reaches the end of the function (incidentally, I think the "yeild break" in there should just be a yield, but that doesn't actually solve the problem since we never get there anyway). Seems like some higher level piece of code with knowledge of all of the running coroutines needs to come in and clean them up, but I'm struggling to find the best place to do that.

Any thoughts on the above would be much appreciated!

Rob.

I meant that "Yield break" should be "break".

Working on Fix

Hi Simon, Rob,

Sorry for the delay on this. It remains my current priority on Bolt 1.

Thanks for investigating further Rob. Some comments on your findings:

Recursion: 


There are three things that are wrong here. First, traversedOrder is never getting freed. Second, traversedCount is only getting freed when fetching a new recursion object from the pool, whereas it should be freed when disposing the object. Third, Reset is useless and not called anywhere.

I'll change the implementation to this:

GraphStack:

GraphPointer is a base class for both GraphStack and GraphReference. GraphReferences don't need to be freed, because they're treated as immutable objects. However, GraphStacks are created and disposed all the time, so they do need to properly release their pointers.

You say:

GraphPointer needs to clear its graphStack. It doesn't have a Dispose function, so I'm doing it in GraphStack.Dispose() (since GenericPool<GraphStack> is what's actually causing the problem).

What's puzzling me here is that it does (or at least it should) be freeing it already. Here's the implementation in GraphStack.cs:


The Dispose method calls Free on the generic pool, which in turns will call IPoolable.Free, which then clears the graph stack. Here's the relevant part in GenericPool:


Can you let me know how exactly you came to the conclusion that the graph stack wasn't being properly freed? Is it from testing with the profiler or just from an educated guess by reading the code?

The graph stack is showing up in a memory profile after exiting the level (along with availableDependencies, which I don't think your earlier fix fully solved):

(PlayTimelineAndWait is a SuperUnit that we pull into some of our graphs)

I've tried some hacks to solve the current problems, but they're causing other issues. Here's what I've tried:

I've added this function to FlowMachine:

public static void ForceClearBoltData()
{
   //HACKHACKHACK
   Ludiq.GenericPool<Ludiq.GraphStack>.Clear();
   Ludiq.Serialization.awaitingDependers.Clear();
   Ludiq.Serialization.availableDependencies.Clear();
}

Between levels I load an empty scene and then call that (followed by Resource.UnloadUnused)

For the problem with Flows not being disposed if they're in a coroutine when you quit the level, I've changed Flow.StopCoroutine to this:

public void StopCoroutine()
{
   // if (!isCoroutine)
   // {
   //     throw new NotSupportedException("Stop may only be called on coroutines.");
   // }
   //
   // coroutineStopRequested = true;
   if (stack.component != null)
   {
      stack.component.StopAllCoroutines();
      Dispose();
   }
}

This is also a hack - calling it on one flow will stop the coroutines for all of the flows in that FlowMachine. I don't care too much about that though, and it seems to solve the problem (although I had to refactor the calling code in EventUnit a little).

These things combined does seem to solve clear out all of the references, but at the moment it looks like it's preventing events firing properly in the next level. 

This is becoming a major issue for us. Have you had any luck with proper solutions?

If its easier, a brute-force re-initialise function that we can call between levels would be OK for now (similar to my ForceClearBoltData above, but more thorough). 

I'll keep trying to solve this, but we could really use some help from you guys. The code is pretty hard to visualise with all of the templates and interfaces, so we're struggling to see how things are intended to work. Any pointers or solutions you could give us would be hugely appreciated!

Rob.

Hi Rob,

Thanks for the continued follow-up. I fully understand that this is a major issue for you, however I must be careful making core changes to fix it as it can easily break higher level behaviour (as you saw with coroutines).

Going forward, I think it's important we all set up a common set of tests and reproduction cases that we share so that we can actually look at the same data and fixes.

Here's my suggestion for the test environment: 

Test environment:

  • Unity 2018.4.x
  • Bolt v.1.4.7
  • Heap Explorer Beta v.3.1

Please send me step-by-step test cases that cause a leak OR a package file with the test case already setup. If you send me steps, I'll set up the package myself and upload it here so we're all on the same page.

Yeah, that all makes sense. I can't put together a test case right now (it's the end of the day here), but Simon's test from earlier is still valid, and I *think* that if you just set up a graph with a Start event as a coroutine, linked to a WaitForSeconds(100) and then a Debug.Log or similar and exit the scene while that's running, you'll see that the FlowMachine still exists when the scene has exited. Otherwise I'll try to put together some more thorough tests early next week.

Hi Lazlo,

I've made a minimal repro case that I think shows all of the leaks that I mentioned. This is using vanilla 1.4.7 (without my changes) and Unity 2018.4.3f1.

BoltMemoryLeak2.zip

Unzip the file, and open it in Unity (you'll need to install Bolt - I didn't include that in the package).

There's 3 scenes: 

StartScene just shows a green screen, waits two seconds and then loads the Bolt scene (using the "WaitAndChangeScene" component on Main Camera

BoltScene has a single embedded Bolt graph that just does this:


It also has a WaitAndChangeScene component on the Main Camera that waits 2 seconds (long enough for the Bolt coroutine to start but not finish) and then loads EndScene

EndScene does nothing except show "Finished" on screen.

Build the project (in Development mode), run it, and wait for it to show finished. Then connect HeapExplorer to it (you'll need to open the normal profiler first to select WindowsPlayer as the target).

In Heap Explorer, select "Investigate" in the "Top 20 Managed Memory Usage", then put "FlowMachine" in the search bar. There should be one entry. (I mean, there *should* be no entries, but there will be one):

Select it and you should see the paths to root in the bottom right:

The RecursionNode one is the one that you already posted a solution for. The other two are currently causing the problems for us.

Hope that helps!

Rob.

Hi Rob,

Thanks for the very thorough reproduction project (if only all my bug reports were like these!).

I'll get to testing it now and report back with my findings.

EDIT 1: Reproduction works, I do get the exact same paths to root.

EDIT 2: The Recursion fix did not remove the 3rd leak, likely because the flow just doesn't get disposed in the coroutine (hence the other 2 leaks as well).

EDIT 3: The issue indeed seems to be that Dispose() doesn't get called in Flow.Coroutine() because the coroutine no longer runs to its end when its runner object is destroyed, and therefore the last MoveNext() never makes gets called.

EDIT 4: This should normally be solvable by putting Dispose() in a try/finally block, because enumerator consumers should always dispose when they're done. However, it seems like Unity's coroutine implementation doesn't, which is really annoying. 

EDIT 5: Starting to reimplement our coroutine strategy, but I have to make sure not to break the fix from this reported issue.

EDIT 6: Success! ;) Will continue testing to make sure I didn't break any other coroutine functionality.

Awesome! If you want to throw us a diff of the files or replacement dlls, I'd be happy to test them out and see if they solve our problems (or cause any other issues)

Also, it's probably worth checking if this also fixes Simon's original reproduction project (again, I'm happy for us to do that ourselves if it's easier)

I also retested Simon's case and I can no longer see the Cat texture reference in the native memory.

+1

Thanks! We'll give it a go in the morning and let you know how it looks. 

Getting closer!

We've still got a problem with SuperUnits. I've updated the test case from yesterday. There's now a super unit called "PlayTimelineAndWait". Calling that from Start and quitting the level while its still playing gives the following leak:

(We do get an extra leak to do with Serialization in our actual game, but I can't get it to happen in the test project. I'm still investigating if there's something that we're doing wrong there). Edit: Fixed that one now. It was our fault.

Interestingly, I also made a SuperUnit called WaitFor20Seconds. Calling that from Start *doesn't* cause the leak. I'm not sure what the difference is.

Here's the updated project: BoltMemoryLeak2.zip

Any thoughts?

Hi Rob,

Thanks for the follow up. I found the cause of that one and seemingly fixed it. Can you test the following patch and let me know?

0003-Fix-memory-leak-when-coroutine-is-in....patch

Progress! The FlowMachine is now destroyed correctly, but the FlowMacro, and anything it links to, is not. 

Using the same test project as yesterday, if at the end you search for PlayableDirector you'll find that it still exists. The problem seems to be that OnDestroy is never called on the Macro, so it never removes itself from the list of available dependencies in Serialization. Here's the path to root for the PlayableDirector:

Hi RobD,

The macro shouldn't get destroyed as it's an asset file and may be reused later.

I'll have to check why it holds on to a reference from the scene. I have a hunch that the Member class keeps its target object instead of using a parameter for it. I think we fixed it in Bolt 2 already but didn't backport it.

Hi Lazlo,

Thanks for the clarification. Is there an ETA on when you might be able to take a look? (This is the last remaining issue that's holding us back from having our Alpha approved, so we're keen to get a resolution!). 

If you won't be able to look at it in the next couple of days, could you give me some pointers on how I can go about fixing it?

Thanks,

Rob.

Hi Rob,

On it now, except this change is more significant than the others, as it changes the core class we use for reflection (Member).

Awesome. Good luck!

Hi Rob,

Can you try this patch and let me know if it fixes everything?

(Should be applied after all 3 previous patches)

0001-Fixed-memory-leak-due-to-Member-hold....patch

Thanks! Looks good in the test project. I'm trying it in our main project now.

One immediate problem is that AOT pre-builds now fail:

(The top line of that got cut off - it's a null reference.)

Update: Looks like it's still leaking:

I'll try to update the test project to reproduce this.

Update 2: I've reproduced it, albeit with significantly less paths to root than in our game (it's only one in the test). Here's an updated test project:BoltMemoryLeak3.zip

To see the leak, in Heap Explorer open the "Investigate" section of the top 20 native memory usage, and search for Cat. It seems to be caused by the direct reference to the texture inside the bolt script, where it uses it to set a material's texture. Just having it indirectly referenced by the Playable Director no longer causes a leak. Here's the path to root that I get from the test project:

Wow that leak is persistent! I was sure this was the last of it.

I'll give it another round... I don't see the direct cause for the last leak from the get go, I'll have to investigate.

EDIT: OK, the AOT stubs error was just a minor refactoring typo.

Ok, looking at the case more closely, it's even trickier than the others...

A few considerations:

  • The macro is a ScriptableObject asset that was loaded in native memory
  • It references another Texture2D asset that it logically has to also load in order to assign the texture
  • This is made clear by the Heap Explorer, as the reference path is a traversed through a blue native C++ reference, not a green managed C# reference like before
  • The full path to root includes the managed serialization dependencies, but that's because the FlowMacro only unregisters from there during OnDestroy, which is never getting called.
  • There are two reasons I think it might not get called:
    • Unity doesn't automatically unused unload assets on scene change (not sure)
    • AND/OR Even if you call Resources.UnloadUnusedAssets manually, because that method checks for static variables too, then the macro shows up as still used, because the serialization variable has it.

It's kind of a chicken and egg: Bolt waits for OnDisable/OnDestroy to free the macro reference from the serialization dependency cache, and that comes from Resources.UnloadUnusedAssets, but Unity won't unload it because the macro reference is inside the serialization dependency cache.

In the docs:

An asset is deemed to be unused if it isn't reached after walking the whole game object hierarchy, including script components. Static variables are also examined.

This Unity Answers post suggests that using WeakReferences might be honored by Unity, so I'll try that.

OK, that's interesting. The example is pretty contrived, so I'm not 100% sure that it's the same cause as the examples in our game. It's obviously still a problem in general, but I might do some more tests and try to find a reproduction that more closely matches our in-game usage.

Incidentally, as a short-term measure, would loading an empty scene then forcibly clearing the dependency cache cause any problems? At this point I'd settle for a safe-but-hacky solution!

Clearing the dependency cache could cause objects that require a dependency to never deserialize, even though that dependency had been deserialized already. Now that might not be common in cross-scene scenarios, but it's kind of hard to anticipate at this low a level.

But if you want to try it, your helper should do:

availableDependencies.Clear(); // Risky?
Resources.UnloadUnusedAssets(); // Should catch unused macros now that the cache is clear

Yeah, that does completely break the game. Loading the same level again after clearing the dependencies causes Bolt silently fail (presumably waiting for its dependencies to be available)

Worryingly, it also doesn't cure the leak. It reduces the paths to root from 4 to 3:

My other backup plan is to just replace all of our macros with code. I think it'd be pretty easy to write a node equivalent to our PlayTimelineAndWait (and a couple of other similar ones), based on the WaitUnit code. 

(Obviously that doesn't solve your problem in general, but it might take the pressure off a bit!)

Don't give up yet ;) I'm testing a fix and it's promising.

Cool :-) 

I'm off now, but if there's anything new on Monday morning I'll give it a go then and let you know.

Well, it seems to work. Here's the patch:

0001-Fix-memory-leak-from-macros-not-unlo....patch

Here's the general strategy if you're curious:

Serialization.cs:

Macro.cs:

Hi Lazlo,

Seems promising! It's definitely fixed all of the known Bolt leaks. There's still a bit of confusion on our end as we're still leaking memory in general and Bolt is still showing up in some of the root paths, but I *think* that's down to problems in our code rather than yours!

We're investigating now. I'll let you know what we find.
Thanks!

Rob.

Fixed (Unreleased)

Hi Rob,

Glad to hear that. I'll push the fix as part of Bolt v.1.4.8 today as the issue has been open for over a month now. Feel free to open a new issue if you find any more leak.