I’ve had the goal for a long time of getting a bot to juggle three balls using reinforcement learning. I think that this is such an interesting problem from an AI perspective (and I really enjoy juggling), so this is something that I really want to get off the ground at some point. I’ve had a few failed attempts in the past, but I’m definitely closer this time around. I’ve got a system roughly set up in Unity to do this using ml-agents, a reinforcement learning system made specifically for Unity by their developers. This is definitely an early update, so my results aren’t going to look good, but I’m happy with the setup so far at least.
Before we get into a more advanced reinforcement learning system to solve this, let’s look at the naive approach and why it doesn’t work well. It seems at first glance that we could make this problem very simple from a rewards/punishment perspective. Simply say that the agent gets a small reward on every frame that a ball isn’t dropped, and then the agent gets a large punishment when a ball hits the ground. With enough time and computing power on the right algorithm, this sort of system might eventually converge to a system that can juggle, but it’s very unlikely. The primary problem is that the rewards are so far separated from the actual actions themselves. A small mistake in throwing a ball won’t result in a punishment until the ball is actually dropped, which could easily be 50 to 100 decisions later. As much as that’s how I want to solve this problem (by giving the agent as little information as possible), it’s just not feasible and so we’re going to have to make some compromises.
One thing that I decided to do in order to make this easier to train is to separate throwing and catching into two different agents. I ideally would just have one juggling agent with one neural network that knows how to juggle, but throwing and catching are very different behaviors and we can get more direct results by separating them. So there’s a “throwing brain” and a “catching brain”, and only is active at a time depending on whether the agent is holding a ball or not.
I’ve also had to play around with the rewards in order to stop certain behaviors that are clearly wrong. For example, I kept seeing behavior where each hand is fine with straying very far from the starting place. We want an agent that will juggle in place, instead of one that moves around. To stop that, I give the agent a very small reward each frame based on how close it is to its original starting place. I also started giving the agent rewards based on how close they are to their target when throwing. My initial setup was a simple “reward if it’s a hit, punishment otherwise”, but this kind of binary decision isn’t going to lead to a good gradient that we can learn from.
A problem that I’m going to have to deal with eventually is timing. If one of the balls gets thrown a little bit fast or slow, then we’re going to end up losing our regular rhythm. In the worst case scenario, two of the balls could end up having to be thrown too fast immediately after one another or even colliding. In order to solve this, I’m going to send a pulse in to the reinforcement learning system to establish timing, and we’ll say that this pulse goes from -1 to 1. Let’s assume for example that I want to make sure one ball is thrown every second, and to make the math easy, these throws should occur at whole numbers. So the first throw is 1 second in, the second throw is 2 second in, etc. At 1 second, my pulse will be sending a 1 to the system. At 1.5 seconds, my pulse would be sending a -1 because that’s the exact wrong time to throw. But then the pulse will gradually increase until we’re at 2 seconds, it hits the maximum, and then it goes back down. On every throw, I will be giving the agent a reward or a punishment based on how close the agent was to when it was supposed to be throwing. The agent should be able to learn very quickly that throws are only supposed to occur at or near the maximum of the timing pulse.
Here’s my current progress on catching. I changed around the physics a bit and haven’t fully re-trained the model yet, so there are more misses than there were earlier (but it will end up being better long term). The reason why there are 16 of these going at once is that it makes training faster. It’s more efficient for Unity to run one physics simulation with 16 experiments going on than to try to run a physics simulation with just 1 experiment, but have it go 16 times faster.
And here’s the (very early) progress on throwing. I’m still playing around with scoring and rewards, so I haven’t bothered to do a very long training session on the throwing so far.
I’m hoping to have a lot more to show on this in a week or two, so fingers crossed!