So as you many know, we’ve spent the last two weeks or so focusing on bug fixing and performance work in advance of Alpha 9′s release. For this week’s dev update, I thought I would do a little something different and share a horror story of a particularly nasty bug, which quickly became infamously known as the “Zombie Worker” bug.
What follows is the account of Team Radiant member Albert, who did the investigation and ultimate fix of the bug.
The “Zombie Worker” Bug
A few weeks ago we started to get an increasing number of reports of zombie workers just standing around doing nothing. We’ve seen this occur when building a large town with many workers and zones (particularly stockpiles and mining zones) that are far apart. As a result, we’ve been optimizing the pathfinding and improving the AI system so that it can scale to handle lots of difficult tasks.
The crazy thing about this bug report was that the pathfinder and AI systems seemed completely idle. We knew this because of the little performance bar that we put in the lower right of the game screen. Check out this screenshot from a player who reported the Zombie Worker bug
Everyone is standing around doing nothing, and will eventually starve to death. Awesome. The crazy part is, look at the little bar in the lower right. It’s completely green, which means the system is idle! If the AI or pathfinder was going bonkers, that bar would be all yellow or orange.
So, based on everything we understand about Stonehearth, this is a completely impossible situation. Hmmm…
Mission: Repro the Bug
The first step in improving the system is to reproduce the problem so that we have a good understanding of the root cause, and so that we can address the underlying problem as opposed to just covering up the symptoms. None of us had seen these zombie towns first hand, so we set about trying to reproduce the problem.
Speculating that this was a result of the AI running out of time and just giving up, we played the game in ways that would stress out the AI system. We saved and loaded the game countless times. Tom even busted out our slowest and oldest computers and played for many game days, but the zombies were nowhere to be seen. Reports of the zombie plague seemed common among our players, but everybody at Radiant seemed immune to it. What were we missing?
Stonehearth Tester Army to the Rescue
It turns out that quite a few of you are excellent sleuths and through many playthroughs, Vince5754 on the forums started to see a pattern:
- Play the game.
- Go to sleep.
- Overnight, all of your hearthlings have turned into zombies.
Seriously, we couldn’t have invented a more imaginative bug. What in the world was creeping into everybody’s computer at night and infecting all their hearthlings ? After more investigation, the second breakthrough came when Tempered posted that the bug always happens after rebooting, and upon loading, every hearthling had become a zombie.
With that, we were able to reproduce the bug and look into what was going on. After a bit of debugging, we saw that our real-time timers were behaving oddly after rebooting. First, a bit of background. In Stonehearth, we use two types of timers: calendar timers and real-time timers. Calendar timers keep track of game time and control events like when to eat and how fast crops grow. Real-time timers keep track of real-life time that we humans live in and are used for things like measuring performance, capturing player input, and… scheduling worker tasks.
So, what was happening? All timers restore their state (the information they need to operate) on load and usually everything just works from there. In this case, the timer itself was restoring properly, but the real-time timer that we read from Windows resets to zero on reboot.
So basically, the timer for task scheduling says, “The Windows real-time timer currently reads 10000. Wake me up when it reaches 10001.” We now save the game and reboot. When Windows boots up again, the real-time timer is reset to 0, but the task scheduler is still waiting for it to reach 10001, which is a really long time. As a result, no new tasks are assigned to any hearthlings and they stand around like zombies wandering about with nothing to do.
If you wait around long enough for the timer to reach 10001, the problem fixes itself. But of course, in almost every case your hearthlings will starve to death first. The good news is that once the bug was identified, it was pretty straight forward to fix in a matter of hours.
Thanks to the hard work from our players in identifying this bug, our real-time timer is now restored properly on load and this bug is now squashed!