Imagine a world where a robotic arm can master the intricacies of a backhand smash or outmaneuver you with a perfectly timed spin serve. This isn’t science fiction, it’s the cutting-edge reality of artificial intelligence. In our quest to push the boundaries of machine learning, we dive into the fascinating journey of “Play Table Tennis with AI.”.
Spoiler: I used reinforcement learning, so for the first training phase I played better than him (then not).
Choosing the Right Model
Tackling the challenge of playing table tennis with AI requires a method that can handle the complexity and dynamism of the game. So what strategy did I use?
Reinforcement learning (RL) is the optimal solution for this task. RL excels in environments where the AI must learn from interaction and adapt its strategies over time. By receiving feedback in the form of rewards and penalties, the AI model iteratively improves its performance, much like a human learning from experience. This ability to learn from trial and error, and continuously refine its actions based on outcomes, makes reinforcement learning particularly well-suited for mastering the nuanced and fast-paced nature of table tennis. Consequently, using this architecture our actors will be:
- Environment: the table tennis table.
- Agent: the expensive KUKA arm.
- Reward: we will define it.
Data Collection and Preparation of the model
In this article, we will explore the process that resulted in the modelling and implementation of a robotic arm in the virtual environment of Pybullet. Pybullet is an open-source library that offers precise real-time physics simulation and robotics, making it a perfect option for creating realistic and dynamic environments for AI training.
Environment
Understanding the environment was the first thing to deal with. As said before, I trained the model in pybullet using state variables like the joint position (in radians) of the arm, current ball position, velocity and many others you can see as follows:
- 0-10 joints position.
- 11-13 pad center position (x,y,z).
- 14-16 pad normal versor (x,y,z).
- 17-19 Current ball position (x,y,z).
- 20-22 Current ball velocity (x,y,z).
- 23-25 Opponent pad center position (x,y,z).
- 26 Game waiting, cannot move (0=no, 1=yes).
- 27 Game waiting for opponent service (0=no, 1=yes).
- 28 Game playing (i.e., not waiting) (0=no, 1=yes).
- 29 Ball in your half-field (0=no, 1=yes).
- 30 Ball already touched your court (0=no, 1=yes).
- 31 Ball already touched your robot (0=no, 1=yes).
- 32 Ball in opponent half-field (0=no, 1=yes).
- 33 Ball already touched opponent’s court (0=no, 1=yes).
- 34-35 Your score, Opponent’s score.
- 36 Simulation time.
But how do we communicate via pybullet? Here is a scheme that describes the fundamental parts of the communication:
As you can see (unless you prefer reading website photos from metatags) the Agent communicates with the Environment (Physics Server) via socket, and they both keep up to date through the sending of the state variables that define agent joints position, ball position, etc.
Training the Model
Let’s dive now into the funniest (and most difficult) about training the model to play table tennis with AI.
“How can I explain to him how to play?”
Before the AI model could master the complexities of table tennis, it was essential to teach it the basics. The initial phase of training focused on fundamental skills such as accurately hitting the ball and avoiding simple mistakes like getting stuck under the table. This foundational training involved repetitive tasks and incremental learning, ensuring that the model developed a solid understanding of basic movements and reactions. But that wasn’t still enough. There were too many state variables and the robot could not generalize them well so the arm was still moving “randomly” (so it was playing like me). For this reason, the best choice was to split the main model into two “sub-models”:
- The first model was responsible for moving the “body” of the robot, so it controlled all the joints except the “wrist” which was used to hold the racket. This network is equipped with two hidden layers with respectively 400 and 300 neurons.
- The second model was in charge of moving the “wrist”, handling rotation and inclination of the racket. This network has two hidden layers, each composed of 64 neurons.
Note: this method allowed us to train the model incrementally by using a modular approach to improve specific parts of the arm. We started by training the first model to accurately position the robotic arm, and then progressed to training the second model.
Reward function
As mentioned previously, it was necessary to teach the robot to play in an “incremental” way, and this process was made available thanks to the use of the reward function which allowed us to choose in which cases and contexts to penalize and reward the robot based on the actions taken. We will see the two reward functions used in the first step (and that as you will see led to a model who plays quietly well).
Note: both functions assign a reward in a range between -100 and +100 depending on the case.
First reward function (Body)
def calculate_reward_net1(state): # Extract status information ball_position = state[17:20] # Position of the ball # Initialize the reward reward = 0 # Case 1: the ball is in the opponent's side of the court if ball_position[1] > TABLE_LENGTH / 2 and ball_position[1] < TABLE_LENGTH: # The ball did not clear the width of the table if ball_position[0] < TABLE_WIDTH / 2 or ball_position[0] > -TABLE_WIDTH / 2: reward += 100 else: # The ball has crossed the width of the table, so it is out but in the opponent's half of the table reward += 10 # Case2: the ball is on our side of the court elif ball_position[1] > TABLE_LENGTH: # The ball is out of the opponent's side reward = -1 - (abs(state[20]) ** 3 + abs(state[21]) ** 3 + abs(state[22]) ** 3) else: # The ball remained on the side of the robot reward += -50 if state[13] < 0: # the pad is positioned under the table reward += -30 return rewardCode Copied!
Second reward function (Wrist)
As you can see, the function evaluates different conditions to assign positive or negative rewards, thereby guiding the agent towards better performance over time. In particular, we defined that in Case 2, by cubing these values and summing them up, the code penalizes high velocities more heavily. The idea is that faster-moving balls, especially those that move erratically or unpredictably, are harder to control and thus warrant a greater penalty. The use of the cube function emphasizes larger values, and, as said before, that makes the penalty particularly severe for higher velocities.
def calculate_reward_net2(state): reward = 0 # Extract status information ball_position = state[17:20] # Position of the ball # Case1: our robot responds by making the ball touch the opponent's field # (the opponent is unable to catch it) if state[33]: reward += 100 # Case2: our robot responds correctly by sending the ball into the opponent's field, # it DOES NOT touch it because the opponent robot "catches it on the fly" elif (ball_position[1] > TABLE_LENGTH / 2) and (ball_position[1] < TABLE_LENGTH): reward += 90 # Case3: The ball goes out of the opponent's side elif (ball_position[1] > TABLE_LENGTH) or (ball_position[2] > 2): # reward proportional to the speed with which the ball is sent out reward += -1 - (abs(state[20]) ** 3 + abs(state[21]) ** 3 + abs(state[22]) ** 3) # Case4: the match is over (done = true) and our court has been touched (the robot # moved the joint badly and was unable to send the ball to the opponent's court) elif (state[30] and state[31]): # Proportional to the distance of the ball from the net reward += -1 - ( TABLE_LENGTH / 2 - state[18]) * 250 else: reward += -100 return rewardCode Copied!
The second reward function is quite similar to the first one (for example, it penalizes the ball velocity as well), but it also makes a distinction between a successful hit that is caught by the opponent and one that is not. It includes a condition for the end of the match with a proportional penalty based on the ball’s distance from the net. As mentioned previously, this function will assist the second model in understanding the precise amount of force required for the racket to strike the ball.
Action
Ok, everything is fine, but how many actions should the model take and when?
We are focused on rewarding specific actions, such as when the robot is actively engaged and not waiting for the opponent’s ball. In our model, we have determined that the robot will only take actions (one for each net) when the ball is near the arm (in the first model), and when the distance between the ball and the racket is less than a certain value (in the second model). This choice allowed us to make the robot understand, in a more explicit way, for each episode (in this case an episode corresponds to a match), what actions it had to carry out to play well.
Other improvements
We will now mention some improvements that led to faster convergence of the model and which allowed us to save considerable time on training.
Trajectory prediction: controlling the ball’s trajectory enabled the robot to anticipate the ball’s path and determine the best position to maximize rewards (projectile motion was used).
Prioritized replay buffer: it is used in reinforcement learning to store and sample experiences more effectively during training. Unlike a regular replay buffer, which samples experiences uniformly at random, a prioritized replay buffer assigns a priority to each experience based on its importance.
The result
“Play Table Tennis with AI”, but what’s the final result?
In the following video, we show a match between our model and Auto Player, a deterministic model that does not have any machine learning part (a little dim, literally without even a neuron in its head) which follows the position of the ball and uses inverse kinematics to hit the ball.
Training to play table tennis with AI from scratch was not easy, especially when, after several training sessions, it kept getting stuck under the table. However, in machine learning, as in life, you cannot learn without making mistakes, and getting stuck before reaching the finish line is part of the process (it’s not an invitation to get stuck under tables).