Technical Articles and Newsletters

Creating a Tendon-Driven Robot That Teaches Itself to Walk with Reinforcement Learning

By Ali Marjaninejad, University of Southern California


为什么工业机器人需要工程师团队和成千上万的代码来执行最基本的重复性任务,而长颈鹿,马和许多其他动物可以在出生后几分钟内行走?

My colleagues and I at the USC Brain-Body Dynamics Lab began to address this question by creating a robotic limb that learned to move, with no prior knowledge of its own structure or environment [1,2]. Within minutes, G2P, our reinforcement learning algorithm implemented in MATLAB®, learned how to move the limb to propel a treadmill (Figure 1).

Watch video.

" data-toggle="lightbox" class="add_margin_0 ">Figure 1. The three-tendon, two-joint robotic limb.

Figure 1. The three-tendon, two-joint robotic limb.Watch video.

Tendon-Driven Limb Control Challenges

机器人的肢体具有类似于人类和脊椎动物运动的肌肉和肌腱结构的建筑[1,2]。肌腱将肌肉连接到骨骼,使生物学成为可能motors(肌肉)施加力量从远处的骨头[3,4]. (The dexterity of the human hand is achieved through a tendon-driven system; there are no muscles in the fingers themselves!)

While tendons have mechanical and structural advantages, a tendon-driven robot is significantly more challenging to control than a traditional robot, where a simple PID controller to control joint angles directly is often sufficient. In a tendon-driven robotic limb, multiple motors may act on a single joint, which means that a given motor may act on multiple joints. As a result, the system is simultaneously nonlinear, over-determined, and under-determined, greatly increasing the control design complexity and calling for a new control design approach.

G2P算法

G2P(一般到典型)算法的学习过程具有三个阶段:运动babling,勘探和剥削。Motor babblingis a five-minute period in which the limb performs a series of random movements similar to the movements a baby vertebrate uses to learn the capabilities of its body.

During the motor babbling phase, the G2P algorithm randomly produces a series of step changes to the current of the limb’s three DC motors (Figure 2), and encoders at each limb joint measure joint angles, angular velocities, and angular accelerations.

Figure 2. Robotic limb and DC motors.

Figure 2. Robotic limb and DC motors.

The algorithm then generates a multilayer perceptron artificial neural network (ANN) using Deep Learning Toolbox™. Trained with the angular measurements as input data and the motor currents as output data, the ANN serves as an inverse map linking the limb kinematics to the motor currents that produce them (Figure 3).

Figure 3. Artificial neural network (ANN) training on motor babbling data.

Figure 3. Artificial neural network (ANN) training on motor babbling data.

Next, the G2P algorithm enters an exploration phase, the first of two phases of reinforcement learning. In the exploration phase, the algorithm directs the robotic limb to repeat a series of cyclic movements and then the G2P algorithm measures how far the treadmill moved. For the cyclic movements, the algorithm uses a uniform random distribution to generate 10 points, with each point representing a pair of joint angles. These 10 points will be interpolated to create a complete trajectory in joint space for the cyclical move. The algorithm then calculates the angular velocities and accelerations for these trajectories and uses the inverse map to obtain the associated motor activation values for the complete cycle. The algorithm feeds these values to the limb’s three motors, repeating the cycle 20 times before checking how far the treadmill moved.

The distance that the limb propels the treadmill is the reward for that attempt: the greater the distance, the higher the reward. If the reward is small or nonexistent, then the algorithm generates a new random cycle and makes another attempt. The algorithm updates the inverse map with the new kinematic information gathered during each attempt. If, however, the reward exceeds a baseline performance threshold (an empirically determined 64 mm), then the algorithm enters its second reinforcement learning phase: exploitation.

In this phase, having identified a series of movements that works reasonably well, the algorithm begins looking for a better solution in the vicinity of the trajectory it previously tested. It does this by using a Gaussian distribution to generate random values near the values used in the previous attempt. If the reward for this new set of values is higher than the previous set, it keeps going, recentering the Gaussian distribution on the new best set of values. When an attempt produces a reward that is lower than the current best, those values are rejected in favor of the “best-so-far” values (Figure 4).

图4.勘探阶段的G2P算法。

图4.勘探阶段的G2P算法。

The Emergence of Unique Gaits

每次G2P算法运行时,它都会开始重新学习,并使用新的随机运动集探索机器人肢体的动力学。当偶然地,电动机bab骨或勘探阶段特别有效时,该算法就会更快地学习,并且需要更少的尝试来达到剥削阶段(图5)。该算法并未寻求最佳的运动来推动跑步机,只有足够好的运动。人类和其他生物也学会了“足够好”的身体使用其身体,因为每种练习尝试都有相关的成本,包括受伤的风险,疲劳以及时间和精力的消耗,可以应用于学习其他技能。

图5.针对G2P算法的15种不同运行中的每一次尝试,跑步机奖励。

图5。Treadmill reward plotted against attempts made for each of 15 different runs of the G2P algorithm.

One remarkable consequence of starting with random movements and searching for a “good enough” solution is that the algorithm produces a different gait every time it is run. We’ve seen the G2P algorithm produce a wide variety of gait patterns, from heavy stomping to dainty tip-toeing. We call these unique gaits "motor personalities" that our robot can develop. We believe such approaches will enable robots to have more anthropomorphic features and traits in the future.

Adding Feedback and Future Enhancements

The initial implementation of G2P was entirely feed-forward. As a result, it had no way of responding to perturbations, such as a collision, other than the system’s passive response. To address this issue, we implemented a version of G2P that incorporates minimal feedback [5]. Even in the presence of reasonably lengthy sensory delays (100 ms), we found that the addition of simple feedback enabled this new G2P algorithm to compensate for errors arising from impacts or imperfections in the inverse map. We also found that feedback accelerates learning and requires shorter motor babbling sessions, or fewer exploration/exploitation attempts.

We plan to extend the principles embodied in the G2P algorithm to the development of biped and quadruped robots, as well as robotic manipulation.

为什么要MATLAB?

由于多种原因,我们的团队决定将MATLAB与其他可用软件包一起使用。首先,我们的研究是多学科的,涉及神经科学家和计算机科学家以及生物医学,机械和电气工程师。无论他们的纪律如何,团队的每个成员都知道MATLAB,使其成为一种通用语言和有效的合作手段。

Another reason for choosing MATLAB is that it makes the work easier for other researchers to replicate and extend. The code we wrote can be run on any version of MATLAB. If we apply zero-phase digital filtering using filtfilt() in MATLAB for example, we can be confident that others will be able to use that same function and get the same results. Moreover, in Python or C, there would be packages or library versions to worry about, as well as dependencies requiring updates or even downgrades to other packages already in use. In my experience, MATLAB has no such limitations.

最后,我想提及MATLAB随附的出色客户支持。金宝app客户支持团队帮助我们解决了金宝app我们在数据获取方面遇到的一些问题。他们的响应时间和对这个话题的专业水平令人印象深刻。

我感激地感谢我的同事达里奥Urbina-Meléndez and Brian Cohn, as well as Dr. Francisco Valero-Cuevas, director of the Brain-Body Dynamics Lab (ValeroLab.org) and PI, who collaborated with me on the projects described in this article. I also want to thank our sponsors including DoD, DARPA, NIH, and USC graduate school for their support for this project.

关于作者

Ali Marjaninejad is a biomedical engineering doctoral candidate at USC. His research interests include artificial intelligence, bio-inspired systems, biomedical signal processing, neuroscience, and brain machine interfaces.

Published 2020

参考

  • [1] Marjaninejad,Ali等。“通过有限的经验,在肌腱驱动的肢体中的自主功能运动。”ARXIV预印型ARXIV:1810.08615(2018)。

    [2] Marjaninejad, Ali, et al. "Autonomous functional movements in a tendon-driven limb via limited experience."自然机器智能1.3 (2019): 144.

    [3] Valero-Cuevas, FJ.Fundamentals of Neuromechanics. Springer Series on Biosystems and Biorobotics, Springer-Verlag, London, 2016.

    [4] Marjaninejad,Ali和Francisco J. Valero-Cuevas。“拟人化系统应该是'多余的'吗?”Biomechanics of Anthropomorphic Systems. Springer, Cham, 2019: 7-34.

    [5] Marjaninejad, Ali, Darío Urbina-Meléndez, and Francisco J. Valero-Cuevas. "Simple Kinematic Feedback Enhances Autonomous Learning in Bio-Inspired Tendon-Driven Systems." arXiv preprint arXiv:1907.04539 (2019).