In debugging a mysteriously hanging Go function recently, I learned something new about how to use WaitGroups and Channels to synchronize Go routines. Keep reading to learn more!
As a relatively new Go programmer, I was stumped recently by a bug that caused a function running multiple Go routines to hang indefinitely. After taking a deeper dive into the usage of WaitGroups and Channels to manage the behavior of Go routines, I finally understood the right way to leverage these tools and was able to resolve the bug.
In this post we'll
- Examine the buggy code and break down why it doesn't work
- Walk through the right way to synchronize Go routines with a simple example
- Fix our bug!
The Blocking Bug
In an application responsible for displaying GitHub usage analytics, we have a function,
MergeContributors, that is responsible for taking two GitHub user accounts that belong to the same person and "merging" them into one so that the UI can accurately reflect their contributions to project commits, pull requests and merges. This "contributor merge" entails the following:
- Update all of the analytic engine's stored Git commits so that all the user's commits from both accounts are correctly associated to one primary account
- Update all of the analytic engine's stored pull requests so that all the user's PRs from both accounts are correctly associated to one primary account
- Update all of the analytic engine's stored PR merges so that all the user's merges from both accounts are correctly associated to one primary account
- Finally, if all of those actions succeeded, update a record indicating that the two accounts have been "merged"
The function was bugging out in the following way:
- If an error occurred in any of the first three steps
- Then the function would block, hanging forever and never returning
Yikes! Let's take a look at the original code so that we can diagnose the issue:
Breaking Down The Bug
Let's walk through what's happening here:
- We establish a WaitGroup,
waitGroupand a channel that expects to receive errors
- We increment the WaitGroup count to three, since we will be using it to orchestrate our three synchronous DB transaction Go routines
- We spin off a separate Go routine that will block until the
waitGroup's count is back down to zero. Once it unblocks, it will close the channel.
We spin off our three concurrent Go routines, one for each database transaction. Each Go routine runs a function that is responsible for making some database call, sending an error to the channel if necessary, and decrementing the
waitGroup's count before the function returns.
Then, we have a call to
waitGroup.Wait(). This will block the execution of the main function until the
waitGroup's count is back down to zero.
After this blocking call, we're using
range to iterate over the messages sent to the channel. The call to
range will continue listening for messages to the channel until the channel is closed. This is a blocking operation.
Once the channel is closed by our earlier Go routine, the one that is waiting for the
waitGroup count to get down to zero and then calling
close(c), the call to
range will stop listening to the channel and the main function will proceed to run, either returning an error if one was received by the channel or moving on to the last piece of work, the call to
Have you spotted the problem yet?
Understanding The Problem
In order to spot the bug, we need to understand something about how sending and receiving messages over a channel works. When we send a message over a channel, that call to send the message is blocking until the message is read from the channel.
Taking a closer look at one of our DB transaction Go routines:
We can understand that the line in which we send a message to the channel,
c <- error, will block the execution of the anonymous function running in our Go routine until that message is read.
Where are we reading messages? Via our
range call, which comes after a call to
Here's the problem:
- The anonymous function sends a message to the channel, which blocks until the message is read
- The message-reading code, our
rangecall, won't run until after the
waitGroupcount is back down to zero, because it comes after a synchronous call to
- Since the message can't be read yet, the anonymous function running in our go routine can't finish running--it won't call the deferred
- This means that the
waitGroupcount will never get back down to zero, which in turn means the call to
waitGroup.Wait()that comes right before our
rangecall will never unblock.
waitGroup.Wait()will never unblock, we can't call
range, the message will never be read from the channel, and we're back where we started--in an infinite block!
The Right Way To Synchronize Go Routines
In order prevent this block, we need to ensure that our
range call will run and read messages from the channel while the Go routines are running. This will ensure that any Go routine that sends a message to a channel will not block on that send, thereby allowing the Go routine's anonymous function to call
Let's take a look at a simple example:
Let's break this down. We:
- Establish a WaitGroup and set its count to three
- Establish a channel that can receive messages that are strings
- Spin off a Go routine that will wait until the
waitGroup's count is zero, then close the channel
- Create three separate Go routines, each of which will write a message to the channel and, once that message is read, decrement the
- Then, at the same time as the running of these Go routines, we are range-ing over the channel, reading any incoming messages and printing them to STDOUT.
- Since our range call is running while the Go routines are running, the messages each routine sends to the channel are read immediately. The calls in each routine to send the message therefore do not block for long, and each routine's anonymous function is able to invoke the
- Once each of the three Go routine concludes, having decremented the
waitGroupcount, the first Go routine (the one that calls
waitGroup.Wait()) will unblock and close the channel.
- Once the channel is closed and all its messages are read, the range will stop listening for messages and the main function will finish running.
If we run the code above, we'll see that the function doesn't block improperly. Instead, we see the following output successfully printed:
Now that we understand the right way to synchronize our Go routines, let's fix our bug!
Fixing the Bug
We need to get rid of that second, synchronous call to
waitGroup.Wait(). This call is preventing messages from getting read from the channel, which in turn prevents any calls to
waitGroup.Done(). This is the cause of the block in our function.
If we remove the offending line, we're left with:
Now, we're ensuring the following behavior:
- Run a Go routine that is blocking, via a call to
waitGroup.Wait(), until the WaitGroup count is down to zero, at which point it closes the channel
- Each "DB transaction" Go routine's anonymous function will, if there is an error, send a message to the channel
- The call to
rangeis listening for such messages, and reading them as they arrive
- Back in each "DB transaction" Go routine, the function is un-blocked, and the call to
waitGroup.Done()will run, decrementing the WaitGroup's count
- Once the WaitGroup's count hits zero, the first Go routine will un-block and close the channel via a call to
- This will tell the
rangecall to stop listening to the channel, and therefore stop blocking, allowing the main function to continue execution
What havoc one misplaced
waitGroup.Wait() can wreak! The key takeaway here is that when you send a message to a channel, it will block until that message is read. We need to ensure that we're reading from any channels we're writing to, in order to successfully synchronize our Go routines.
In the case of our bug, we were improperly blocking the execution of our function, preventing any messages from getting read from the channel. By removing our extra
waitGroup.Wait(), we guaranteed that messages sent the channel would be read, allowing the
waitGroup's counter to be decremented. This in turn ensured that the channel would be closed, causing the range over that channel's messages to unblock, and allowing the main function to continue executing.
This bug certainly gave me a better understanding of Go routine synchronization, and I hope it was helpful to you too.