Implicit Behavioral Cloning
Abstract
We find that across a wide range of robot policy learning scenarios, treating supervised policy learning with an implicit model generally performs better, on average, than commonly used explicit models. We present extensive experiments on this finding, and we provide both intuitive insight and theoretical arguments distinguishing the properties of implicit models compared to their explicit counterparts, particularly with respect to approximating complex, potentially discontinuous and multivalued (setvalued) functions. On robotic policy learning tasks we show that implicit behavioral cloning policies with energybased models (EBM) often outperform common explicit (Mean Square Error, or Mixture Density) behavioral cloning policies, including on tasks with highdimensional action spaces and visual image inputs. We find these policies provide competitive results or outperform stateoftheart offline reinforcement learning methods on the challenging humanexpert tasks from the D4RL benchmark suite, despite using no reward information. In the real world, robots with implicit policies can learn complex and remarkably subtle behaviors on contactrich tasks from human demonstrations, including tasks with high combinatorial complexity and tasks requiring 1mm precision.
Implicit Models, EnergyBased Models, Imitation Learning
1 Introduction
Behavioral cloning (BC) [pomerleau1989alvinn] remains one of the simplest machine learning methods to acquire robotic skills in the real world. BC casts the imitation of expert demonstrations as a supervised learning problem, and despite valid concerns (both empirical and theoretical) about its shortcomings (e.g., compounding errors [ross2011reduction, tu2021closing]), in practice it enables some of the most compelling results of real robots generalizing complex behaviors to new unstructured scenarios [zhang2018deep, florence2019self, zeng2020tossingbot]. Although considerable research has been devoted to developing new imitation learning methods [ho2016generative, abbeel2004apprenticeship, ho2016model] to address BC’s known limitations, here we investigate a fundamental design decision that has largely been overlooked: the form of the policy itself. Like many other supervised learning methods, BC policies are often represented by explicit continuous feedforward models (e.g., deep networks) of the form that map directly from input observations to output actions . But what if is the wrong choice?
In this work, we propose to reformulate BC using implicit models – specifically, the composition of with a continuous energy function (see Sec. 2 for definition) to represent the policy :
This formulates imitation as a conditional energybased modeling (EBM) problem [lecun2006tutorial] (Fig. 1), and at inference time (given ) performs implicit regression by optimizing for the optimal action via sampling or gradient descent [welling2011bayesian, du2019implicit]. While implicit models have been used as partial components (e.g., value functions) for various reinforcement learning (RL) methods [haarnoja2017reinforcement, du2020planning, kostrikov2021offline, nachum2021provable], our work presents a distinct yet simple method: do BC with implicit models. Further, this enables a unique case study that investigates the choice between implicit vs. explicit policies that may inform other policy learning settings beyond BC.
Our experiments show that this simple change can lead to remarkable improvements in performance across a wide range of contactrich tasks: from bimanually scooping piles of small objects into bowls with spatulas, to precisely pushing blocks into fixtures with tight 1mm tolerances, to sorting mixed collections of blocks by their colors. Results show that implicit models for BC exhibit the capacity to learn longhorizon, closedloop visuomotor tasks better than their explicit counterparts – and surprisingly, give rise to a new class of BC baselines that are competitive with stateoftheart offline RL algorithms on standard simulated benchmarks [fu2020d4rl]. To shed light on these results, we provide observations on the intuitive properties of implicit models, and present theoretical justification that we believe are highly relevant to part of their success: their ability to represent not only multimodal distributions, but also discontinuous functions.
Paper Organization. After a brief background (Sec. 2), to build intuition on the nature of implicit models, we present their empirical properties (Sec. 3). We then present our main results with policy learning (Sec. 4), both in simulated tasks and in the real world. Inspired by these results, we provide theoretical insight (Sec. 5), followed by related work (Sec. 6) and conclusions (Sec. 7).
2 Background: Implicit Model Training and Inference
We define an implicit model as any composition , in which inference is performed using some generalpurpose function approximator to solve the optimization problem . We use techniques from the energybased model (EBM) literature to train such a model. Given a dataset of samples , and regression bounds , training consists of generating a set of negative counterexamples for each sample in a batch, and employing an InfoNCEstyle [oord2018representation] loss function. This loss equates to the negative log likelihood of , and the counterexamples are used to estimate :
With a trained energy model , implicit inference can be performed with stochastic optimization to solve . To demonstrate a breadth of approaches, we present results with three different EBM training and inference methods discussed below, however a comprehensive comparison of all EBM variants is outside the scope of this paper; see [song2021train] for a comprehensive reference. We use either a) a derivativefree (samplingbased) optimization procedure, b) an autoregressive variant of the derivativefree optimizer which performs coordinate descent, or c) gradientbased Langevin sampling [welling2011bayesian, du2019implicit] with gradient penalty [gradientpenalty2021] loss during training – see the Appendix for descriptions and comparisons of these choices.
3 Intriguing Properties of Implicit vs. Explicit Models
Consider an explicit model , and an implicit model where both and are represented by almostidentical network architectures. Comparing these models, we examine: (i) how do they perform near discontinuities?, (ii) how do they fit multivalued functions?, and (iii) how do they extrapolate? For both and we use almostidentical ReLUactivation fullyconnected MultiLayer Perceptrons (MLPs), with the only difference being the additional input of in the latter. Explicit “MSE” models are trained with Mean Square Error (MSE), explicit “MDN” models are Mixture Density Networks (MDN) [bishop1994mixture], and implicit “EBM” models are trained with and optimized with derivativefree optimization. Figs. 2, 3 show models trained on a number of functions (Fig. 2) and multivalued functions (Fig. 3). For each of these we examine regions of discontinuities, multimodalities, and/or extrapolation.
Discontinuities. Implicit models are able to approximate discontinuities sharply without introducing intermediate artifacts (Fig. 2a), whereas explicit models (Fig. 2d), because they fit a continuous function to the data, take every intermediate value between training samples. As the frequency of discontinuities increases, the implicit model predictions remain sharp at discontinuities, while also respecting local continuities, and with piecewise linear extrapolations up to some decision boundary between training examples (Fig. 2ac). The explicit model interpolates across each discontinuity (Fig. 2df). Once the training data is uncorrelated (i.e. random noise) and without regularization (Fig. 2c, Fig. 2f), implicit models exhibit a nearestneighborslike behavior, though with nonzero segments around each sample.
Extrapolation. For extrapolation outside the convex hull of the training data (Fig. 2af), even with discontinuous or multivalued functions, implicit models often perform piecewise linear extrapolation of the piecewise linear portion of the model nearest to the edge of the training data domain. Recent work [xu2020neural] has shown that explicit models tend to perform linear extrapolation, but the analysis assumes the ground truth function is continuous.
Multivalued functions. Instead of using to identify a single optimal value, may return a set of values, which may either be interpreted probabilistically as sampling likely values from the distribution, or in optimization as the set of minimizers ( is setvalued). Fig. 3 compares a ReLUMLP trained as a Mixture Density Network (MDN) vs an EBM across three example multivalued functions.
Visual Generalization Of particular relevance to learning visuomotor policies, we also find striking differences in extrapolation ability with converting highdimensional image inputs into continuous outputs. Fig. 4 shows how on a simple visual coordinate regression task, which is a notoriously hard problem for convolutional networks [liu2018intriguing], an MSEtrained ConvMLP model [levine2016end] with CoordConv [liu2018intriguing] struggles to extrapolate outside the convex hull of the training data. This is consistent with findings in [florence2019self, zeng2020transporter]. A ConvMLP trained via late fusion (Fig. 4b) as an EBM, on the other hand, extrapolates well with only a few training data samples, achieving 1 to 2 orders of magnitude lower testset error in the lowdata regime (Fig. 4d). This is additional evidence that distinguishes implicit models from explicit models in a distinct way from multimodality, which is absent in this experiment.
4 Policy Learning Results
image  human  unknown  multimodal  

Benchmark  input  demos  cardinality  solutions 
D4RL HumanExperts  ✗  ✓  ✗  ✗ 
Particle Integrator  ✗  ✗  ✗  ✗ 
Block Pushing  ✓  ✗  ✗  ✓ 
Planar Sweeping  ✓  ✓  ✓  ✓ 
BiManual Sweeping  ✓  ✗  ✓  ✓ 
Real Robot  ✓  ✓  ✗  ✓ 
We evaluate implicit models for learning BC policies across a variety of robotic task domains (Fig. 5). The goals of our experiments are threefold: (i) to compare the performance of otherwiseidentical policies when represented as either implicit or explicit models, (ii) to test how well our models (both implicit and explicit) compare with authorreported baselines on a standard set of tasks, and (iii) to demonstrate that implicit models can be used to learn effective policies from human demonstrations with visual observations on a real robot. The following results and discussions are organized by task domain – each evaluating a unique set of desired properties for policy learning (Table 1). All tasks are characterized by discontinuities and require generalization (e.g., extrapolation) to some degree.
D4RL [fu2020d4rl] is a recent benchmark for offline reinforcement learning. We evaluate our implicit (EBM) and explicit (MSE) policies across the subset of tasks for which offline datasets of human demonstrations are provided, which is arguably is the hardest set of tasks. Surprisingly, we find that our implementations of both implicit and explicit policies significantly outperform the BC baselines reported on the benchmark, and provide competitive results with stateoftheart offline reinforcement learning results reported thus far, including CQL [kumar2020conservative] and S4RL [sinha2021s4rl]. By adding perhaps the simplest way to use reward information, if we prioritize sampling to be only the top 50% of demonstrations sorted by their returns (similar to RewardWeighted Regression (RWR) [peters2007reinforcement]), this intriguingly generally improves implicit policies, in some cases to new stateoftheart performance, while less so for explicit models. This suggests that implicit BC policies value data quality higher than explicit BC policies do. A simple NearestNeighbor baseline (see Appendix) performs better than one might expect on these tasks, but on average not as well as implicit BC.
Baselines  Ours  
Explicit  Implicit  Explicit  Implicit  
Method  Nearest  BC  CQL [kumar2020conservative]  S4RL [sinha2021s4rl]  BC (MSE)  BC (EBM)  BC (MSE)  BC (EBM)  
Neighbor  (from CQL [kumar2020conservative])  w/ RWR [peters2007reinforcement]  w/ RWR [peters2007reinforcement]  
Uses data  
Domain  Task Name  
Franka  kitchencomplete  1.92 0.00  1.4  1.8  3.08  1.76 0.04  3.37 0.19  1.22 0.18  3.37 0.01 
kitchenpartial  1.70 0.00  1.4  1.9  2.99  1.69 0.02  1.45 0.35  1.86 0.26  2.18 0.05  
kitchenmixed  1.46 0.00  1.9  2.0  2.15 0.06  1.51 0.39  2.03 0.06  2.25 0.14  
Adroit  penhuman  1908.0 0.0  1121.9  1214.0  1419.6  2141 109  2586 65  2108 58.8  2446 207 
hammerhuman  85.2 0.0  82.4  300.2  496.2  38 25  133 26  35.1 45.1  9.3 45.5  
doorhuman  91.8 0.0  41.7  234.3  736.5  79 15  361 67  17.9 13.8  399 34  
relocatehuman  3.8 0.0  5.6  2.0  2.1  3.5 1.1  0.1 2.4  3.7 0.3 
While many of the D4RL tasks have complex highdimensional action spaces (up to 30D), they do not emphasize the full spectrum of task attributes (Table 1) we are interested in. The following tasks isolate other attributes or introduce new ones, such as highly stochastic dynamics (i.e., singlepointofcontact block pushing), complex multiobject interactions (many small particles), and combinatorial complexity.
ND Particle Integrator is a simple environment with linear dynamics but where a discontinuous oracle policy is used to generate training demonstrations: once within the vicinity of goalconditioned location (Fig. 5, shown for ), the policy must switch to the second goal. The benefit of studying this environment is twofold: (i) it has none of the complicating attributes in Table 1 and so allows us to study discontinuities in isolation, and (ii) we can define this simple environment to be in dimensions. Varying from 1 to 32 dimensions, but holding the number of demonstrations constant, we find we are able to train 95% successful implicit policies up to 16 dimensions, whereas explicit (MSE) policies can only do 8 dimensions with the same success rate. The NearestNeighbor baseline, meanwhile, cannot generalize, and only performs well on the 1D task (see Appendix for more analysis).
Method  Single Target,  Multi Target,  Single Target, 

states  states  pixels  
EBM  100 0  99.0 0.0  100 0 
MDN  100 0  99.7 0.5  10.0 4.3 
MSE  98.3 0.5  89.7 4.8  87.0 4.1 
NearestNeighbor  4.0 0.0  0.0 0.0  4.3 1.9 
Simulated Pushing consists of a simulated 6DoF robot xArm6 in PyBullet [coumans2016pybullet] equipped with a small cylindrical end effector. The task is to push a block into the target goal zone, marked by a green square labeled on the tabletop. We investigate 2 variants: (a) pushing a single block to a single target zone, or (b) also pushing the block to a second goal zone (multistage). We evaluate implicit (EBM) and explicit (MSE and MDN [rahmatizadeh2018vision, ha2018world]) policies on both variants, trained from a dataset of 2,000 demonstrations using a scripted policy that readjusts its pushing direction if the block slips from the end effector. Results in Table 3 show that all learning methods perform well on the singletarget task, while MSE struggles with the slightly longer task horizon. For the imagebased task, the MDN significantly struggles compared to MSE and EBM. The failures of the NearestNeighbor baseline, with only 04% success rate, show that generalization is required for this task.
Planar Sweeping [suh2020surprising] is a 2D environment that consists of an agent (in the form of a blue stick) where the task is to push a pile of 50  100 randomly positioned particles into a green goal zone. The agent has 3 degrees of freedom (2 for position, 1 for orientation). We train implicit (EBM) and explicit (MSE) policies from 50 teleoperated human demonstrations, and test on episodes with unseen particle configurations. For the imagebased inputs, we also test two types of encoders with different forms of dimensionality reduction: spatial soft(arg)max and average pooling over dense features (see Appendix for architecture descriptions). For the statebased inputs, since the number of particles vary between episodes, we flatten the poses of the particles and 0pad the vector to match the size of the vector at maximum particle cardinality.
# ResNet layers  

Method  Input & Encoder  8  14  20 
EBM  image + softmax  78.7 4.9  82.1 0.9  82.6 3.1 
EBM  image + pool  78.0 2.2  76.5 1.0  74.2 1.9 
EBM  state  28.7 0.8  29.2 0.5  28.9 0.2 
MSE  image + softmax  62.9 5.0  51.4 8.9  56.6 5.2 
MSE  image + pool  75.6 1.3  73.9 1.7  74.8 1.2 
MSE  state  28.9 0.2  28.2 0.4  27.8 0.3 
[table]A table beside a figure
The results in Table 4 (averaged over 3 training runs with different seeds) suggest that imagebased EBMs outperform the best MSE architectures by 7%. Interestingly, imagebased EBMs seem to synergize well with spatial soft(arg)max for dimensionality reduction, as opposed to pooling, which works best for MSE explicit policies. In both cases, state observations as inputs do not perform well compared with image pixel inputs. This is likely because the particles have symmetries in image space, but not when observed as a vector of poses.
Simulated BiManual Sweeping consists of two robot KUKA IIWA arms equipped with spatulalike end effectors. The task is to scoop up randomly configured particles from a workspace and transport them into two bowls, which should be filled up equally. Successfully scooping particles and transporting them requires precise coordination between the two arms (e.g., such that the particles do not drop while being transported to the bowls). The action space is 12DoF (6DoF Cartesian per arm), and each episode consists of 700 steps recorded at 10Hz. Perspective RGB images from a simulated camera are used as visual input, along with current end effector poses as state input. The task is characterized by many mode changes and discontinuities (transitioning from scooping to lifting, from lifting to transporting, and deciding which bowl to transport to). EBM and MSE policies on the task use the best corresponding image encoder from the planar sweeping task. As shown in Table 5, our results show that EBM outperforms MSE by 14%.
Method  Input and Encoder  Success % 

EBM  image + softmax  78.2 2.7 
MSE  image + pool  63.9 7.7 
[table]A table beside a figure
Real Robot Manipulation, using a cylindrical endeffector on an xArm6 robot (Fig. 9a), we evaluate implicit BC and explicit BC policies on 4 realworld manipulation pushing tasks: 1) pushing a red block and a green block into assigned target fixtures, 2) pushing the red and green blocks into either target fixture, in either order, 3) precise pushing and insertion of a blue block into a tight (1mm tolerance) target fixture, and 4) sortation of 4 blue blocks and 4 yellow into different targets. The observation input is only raw perspective RGB images at 5Hz, with task horizons up to 60 seconds, and teleoperated demonstrations.
Task  PushRedthenGreen  PushRed/GreenMultimodal  InsertBlue  SortBluefromYellow  

# demos  95  410  223  502  
Avg. lengths std.  19.1 2.5  19.0 3.1  22.1 5.5  45.2 8.2  
[min, max] (seconds)  [14.2, 25.1]  [11.8, 28.1]  [13.0, 43.5]  [25.8, 60.5]  
Success criterion  1.0 if both blocks in target  1.0 if both blocks in target 

for each correct block in target  
Success avg. (%)  
Implicit BC (EBM)  85.0 5.0  88.3 7.6  83.3 3.8  48.3 4.6  
Explicit BC (MSE)  35.0 18.0  55.0 18.0  6.7 9.4  19.6 1.5 
Across all four tasks, we observe significantly higher performance for the implicit policies compared to the explicit baseline. This is especially apparent on the pushingandorientedinsertion task (Insert Blue), which requires highly discontinuous behavior in order to subtly nudge enough, but not too far, the block into place (Fig. 9c). On this task we see the implicit BC policy has an order of magnitude higher success rate than the explicit BC policy. The sorting task in particular (SortBlueFromYellow, Fig. 9d) is our attempt to push the generalization abilities of our models, and we see a 2.4x higher success rate for the implicit policy. Note these experimental results are averaged over 3 different models, for each task, for each policy type. The red/green pushing tasks, including multimodal variant (Fig. 9b), also show notably higher success rates for the implicit policies. These realworld results are best appreciated in our video.
5 Theoretical Insight: Universal Approximation with Implicit Models
In previous sections, we have empirically demonstrated the ability of implicit models to handle discontinuities (Section 3), and we hypothesized this is one of the reasons for the strong performance of implicit BC policies (Section 4). Two theoretical questions we now ask are: (i) is there a provable notion for what class of functions can be represented by implicit models given some analytical , and (ii) given that energy functions learned from data may always be expected to have nonzero error of approximating any function, are there inference risks with large behaviour shifts resulting from a combination of and spurious peaks in ? Recent work [marx2021semi] has shown that a large class of functions (namely, functions defined by finitely many polynomial inequalities) can be approximated implicitly by using SOS polynomials to represent . Here we show that for implicit models with represented by any continuous function approximator (such as a deep ReLUMLP network), can represent a larger set of functions including multivalued functions and discontinuous functions (Thm. 1), to arbitrary accuracy (Thm. 2). These results are stated formally in the following; proofs are in the Appendix.
Theorem 1.
For any setvalued function where the graph of is closed, there exists a continuous function , such that for all .
Theorem 2.
For any setvalued function , there exists a function that can be approximated by some continuous function approximator with arbitrarily small bounded error , such that provides the guarantee that the distance from to the graph of is less than .
Of practical note, explicit functions ( in Thms. 1 and 2) with arbitrarily small or large Lipschitz constants can be approximated by an implicit function with bounded Lipschitz constant (see Appendix for more discussion). This means that implicit functions can approximate steep or discontinuous explicit functions without large gradients in the function approximator that may cause generalization issues. This is not the case for explicit continuous function approximators, which must match the large gradient of the approximated function. In both their multivalued nature and discontinuityhandling, the approximation capabilities of implicit models are distinctly superior to explicit models. See Fig. 10 for visual intuition, and more discussion in the Appendix.
6 Related Work
EnergyBased Models, Implicit Learning. Reviews of energybased models can be found in LeCun et al. [lecun2006tutorial] and Song & Kingma [song2021train]. Du & Mordatch [du2019implicit] proposed Langevin MCMC [welling2011bayesian] sampling for training and implicit inference, and argued for several strengths of implicit generation, including compositionality and empirical results such as outofdistribution generalization and longhorizon sequential prediction. A general framework for energybased learning of behaviors is also presented in [mordatch2018concept]. In applications, energy based models have recently shown stateoftheart results across a number of domains, including various computer vision tasks [gustafsson2020energy, gustafsson2020train], as well as generative modeling tasks such as image and text generation [du2019implicit, du2020improved, deng2020residual]. Many other works have investigated using the notion of implicit functions in learning, including works that investigate implicit layers [amos2017optnet, niculae2018sparsemap, wang2019satnet, bai2019deep]. There is also a surge of interest in geometry representation learning in implicit representations [park2019deepsdf, mescheder2019occupancy, chen2019learning, saito2019pifu]. In robotics, implicit models have been developed for modeling discontinuous contact dynamics [pfrommer2020contactnets].
EnergyBased Models in Policy Learning. In reinforcement learning, [haarnoja2017reinforcement] uses an EBM formulation as the policy representation. Other recent work [du2020planning] uses EBMs in a modelbased planning framework, or uses EBMs in imitation learning [liu2020energy] but with an onpolicy algorithm. A trend as well in recent RL works has been to utilize an EBM as part of an overall algorithm, i.e. [kostrikov2021offline, nachum2021provable].)
Policy Learning via Imitation Learning. In addition to behavioral cloning (BC) [pomerleau1989alvinn], the machine learning and robotics communities have explored many additional approaches in imitation learning [osa2018algorithmic, peng2018deepmimic, peng2021amp], often in ways that need additional information. One route is by collecting onpolicy data of the learned policy, and potentially either labeling with rewards to perform onpolicy reinforcement learning (RL) [atkeson1997robot, ng2000algorithms, rajeswaran2017learning] or labeling actions by an expert [ross2011reduction]. Distributionmatching algorithms like GAIL [ho2016generative] require no labeling, but may require millions of onpolicy environment interactions. While algorithms like ValueDice [kostrikov2019imitation] implement distribution matching in a sampleefficient offpolicy setting, they have not been proven on imageobservations or high degreeoffreedom action spaces. Another route to using more information beyond BC is for the offpolicy data to be labeled with rewards, which is the focus of the offline RL community [fu2020d4rl]. All of these directions are good ideas. A perhaps not fully appreciated finding, however, is that in some cases even the simplest forms of BC can yield surprisingly good results. On offline RL benchmarks, prior works’ implementations of BC already show reasonably competitive results with offline RL algorithms [fu2020d4rl, gulcehre2020rl]. In realworld robotics research, BC has been widely used in policy learning [zhang2018deep, rahmatizadeh2018vision, florence2019self, zeng2020transporter]. Perhaps the success of BC comes from its simplicity: it has the lowest data collection requirements (no reward labels or onpolicy data required), can be dataefficient [florence2019self, zeng2020transporter], and it is arguably the simplest to implement and easiest to tune (with fewer hyperparameters than RLbased methods).
Approximation of Discontinuous Functions. The foundational results of Cybenko [cybenko1989approximation] and others in Universal Approximation of neural networks have had foundational impact in guiding machine learning research and applications. Various approaches have been developed in the function approximation literature and elsewhere to approximate discontinuous functions [butzer1987approximation, tampos2012accurate, kvernadze2010approximation, stella2016very], which typically do not use neural networks. Also motivated by applications to modeling phenomena for robots, [selmic2002neural] develops theory of approximating discontinuous functions with neural networks, but the method requires apriori knowledge of the discontinuity’s location. Our work builds on the wellknown and wellapplied results in continuous neural networks, but through composition with provides a notion of universal approximation even for discontinuous, setvalued functions.
7 Conclusion
In this paper we showed that reformulating supervised imitation learning as a conditional energybased modeling problem, with inferencetime implicit regression, often greatly outperforms traditional explicit policy baselines. This includes on tasks with highdimensional action spaces (up to 30dimensional in the D4RL humanexpert tasks), visual observations, and in the real world. In terms of limitations, a primary comparison with explicit models is that they typically require more compute, both in training and inference (see Appendix for comparisons). However, we have both shown that we can run implicit policies for realtime visionbased control in the real world, and training time is modest compared to offline RL algorithms. To further motivate the use of implicit models, we presented an intuitive analysis of energybased model characteristics, highlighting a number of potential benefits that, to the best of our knowledge, are not discussed in the literature, including their ability to accurately model discontinuities. Lastly, to ground our results theoretically we developed a notion of universal approximation for implicit models which is distinct from that of explicit models.
The authors would like to thank Vikas Sindwhani for project direction advice; Steve Xu, Robert Baruch, Arnab Bose for robot software infrastructure; Jake Varley, Alexa Greenberg for ML infrastructure; and Kamyar Ghasemipour, Jon Barron, Eric Jang, Stephen Tu, Sumeet Singh, JeanJacques Slotine, Anirudha Majumdar, Vincent Vanhoucke for helpful feedback and discussions.
References
Appendix for Implicit Behavioral Cloning
Contents
 1 Introduction
 2 Background: Implicit Model Training and Inference
 3 Intriguing Properties of Implicit vs. Explicit Models
 4 Policy Learning Results
 5 Theoretical Insight: Universal Approximation with Implicit Models
 6 Related Work
 7 Conclusion
 A Contributions Statement
 B EnergyBased Model Training and Implicit Inference Details

C Additional Experimental Details and Analysis
 C.1 PerTask Summary of # Demonstrations and Environment Dimensionalities
 C.2 Training and Inference Times, Implicit vs. Explicit Comparison
 C.3 Additional RealWorld Experimental Details
 C.4 NearestNeighbor Baseline
 C.5 D Particle Environment Description
 C.6 Analysis: Training Data Sparsity in the D Particle Tasks
 C.7 Additional D4RL tasks
 D Policy Learning Results Overview and Protocol
 E Model Architectures
 F Proofs
 G Theory Implications and Discussion
 H Limitations
Appendix A Contributions Statement
Due to space constraints we did not include a comprehensive contributions statement in the main manuscript, but include one here for clarity:

We present Implicit Behavioral Cloning (Implicit BC), which is a novel, simple method for imitation learning in which behavioral cloning is cast as a conditional energybased modeling (EBM) problem, and inference is performed via samplingbased or gradientbased optimization.

We validate Implicit BC in realworld robot experiments, in which we demonstrate physical robots performing several endtoend, contactrich pushing tasks (including precision insertion, and multiitem sorting) driven with only images as input, and only human demonstrations provided as training data. Implicit BC performs significantly better than our explicit BC baseline across all realworld tasks, including an orderofmagnitude increase in performance on the precision insertion task. On the sorting task, the models are shown to be capable of solving an upto60second horizon for a contactrich, combinatorial task with complex multiobject collisions.

We present extensive simulation experiments comparing Implicit BC to both comparable explicit models from the same codebase, and also authorreported quantitative results on the humanexpert tasks from the standard D4RL benchmark. We find both our explicit BC and implicit BC models provide competitive or stateoftheart performance on D4RL tasks with humanprovided demonstrations, despite using no reward information. Averaged across all tasks, we find implicit BC outperforms our own best explicit BC models.

We analyze the nature of implicit models in simple 1D1D examples, and we highlight aspects of implicit models that we believe are not known to the generative modeling community, including their behavior (i) at discontinuities and (ii) in extrapolation.
Appendix B EnergyBased Model Training and Implicit Inference Details
Our results critically depend on energybased model (EBM) training, but we do not consider the specific methods we use to be our main contributions (see Sec. A for a list). That said, after considerable experience training conditional EBMs on both simple functionfitting tasks, and on policy learning tasks, we believe it is useful to the research community to describe method specifics in detail. Our goal is to emphasize simplicity when possible, in order to encourage more folks to use implicit energybased regression rather than explicit regression. We first review our approach using derivativefree optimization, then our autoregressive version, and then our approach using Langevin gradientbased sampling. For each, we discuss (i) how to train the models, and (ii) how to perform inference with the models. For a more comprehensive overview of training EBMs, see [song2021train]. Note we will release code as well for training and inference.
For all methods, to compute and we (1) take the perdimension min and max over the training data, (2) add a small buffer, typically 0.05() on each side, and then (3) clip these min and max values to the environments’ allowed min/max values. For agents that do not use the full range of the environments’ allowed values for a given dimension, this enables more precision on that action dimension. Also all methods use Adam optimizer with default , values.
b.1 Method with DerivativeFree Optimization.
For training, this method is very simple. For counterexamples we draw from the uniform random distribution: , where . Training consists of drawing batches of data, sampling counterexamples for each sample in each batch, and applying (Sec. 2). We typically use a batch size of 512, with 256 counterexamples per sample in the batch. All and (i.e. and for observations and actions), in the training dataset are normalized to perdimension zeromean, unit variance. We use typically a initial learning rate and an exponential decay, 0.99 decay each 100 steps. We find that regularizing the models with Dropout does not help performance, perhaps because the stochastic training process (counterexample sampling in each training step) selfregularizes the models.
Given a trained energy model , we use the following derivativefree optimization algorithm to perform inference:
Where refers to sampling times from the multinomial distribution with probabilities returning associated elements . For simplicity the noise is written as being drawn from , but this should be an dimensional vector with an independent Gaussian noise sample for each element. This algorithm is very similar to the Cross Entropy Method [de2005tutorial], but has a few differences: (i) our algorithm does not use a fixed number of elites, (ii) resampling with replacement, and (iii) we shrink the sampling variance via a prescribed schedule rather than computing empirical variances. We typically use , unless otherwise noted.
While the above method works great for up to of 5 dimensions or less (Sec. B.4), we look at both autoregressive and gradientbased methods for scaling to higher dimensions.
b.2 Method with Autoregressive DerivativeFree Optimization.
In the autoregressive version we interleave training and inference with models, for , i.e. one model for each dimension . Model takes in all dimensions up to . This isolates sampling to one degree of freedom at a time, and enables scaling to higher dimensional action spaces. For more on autoregressive energy models, see [nash2019autoregressive].
b.3 Method with Gradientbased, Langevin MCMC
For gradientbased MCMC (Markov Chain Monte Carlo) training we use the approach described in [du2019implicit, mordatch2018concept] which uses stochastic gradient Langevin dynamics (SGLD) [welling2011bayesian]:
Note that in the conditional case, is respect to only , and not . As in [du2019implicit, mordatch2018concept] we initialize from the uniform distribution, similar to Sec. B.1, but then optimize these contrastive samples with MCMC. For each , we run steps of the MCMC chain. As recommended in [grathwohl2019your] we use a polynomiallydecaying schedule for the stepsize . Note backpropagation is not performed backwards through the chain, but rather a stop_gradient() is used after implicitly generating the samples [du2019implicit]. Also as in [du2019implicit] we clip gradient steps, choosing to clip the full value, i.e. after the gradient and noise have been combined. Additionally for inference we run the Langevin MCMC chain a second time, giving twice as many inference Langevin steps as were used during training. Also, for Langevin, all (i.e. for actions), in the training dataset are normalized perdimension to span the range .
b.3.1 Gradient Penalty
For additional stability of training, we use both spectral normalization [miyato2018spectral] as in [du2019implicit], and also add gradient penalties. Gradient penalties are well known in the GAN community, and the form of our gradient penalty is inspired by [gradientpenalty2021]:
Where the sums over , , , represent respectively the sum over training samples, counterexamples per each data sample, and some subset of iterative chain samples for which we find it is sufficient to use only the final step, . controls the scale of the gradient relative to the noise in SGLD. If is too large, then the noise in SGLD has little effect; if is too small, then the noise overpowers the gradient. Empirically we find is a good setting. On each step of training, the gradient penalty loss is simply added to the InfoNCE loss, i.e. . Lastly, we note there are other approaches for improving stability of Langevinbased training, such as loss functions with entropy regularization [du2020improved].
To aid intuition on why constraints on the gradients are allowable restrictions for the model, Corollary 1.1 shows that the energy model is capable of having an arbitrary Lipschitz constant.
b.4 Comparison of EBM Variants
A key comparison between these methods is the tradeoff of simplicity for higherdimensional action spaces. As shown in Fig. 11, with only 2,000 demonstrations in the D particle environment, the jointdimensionsoptimized derivativefree version (Sec. B.1) fails to solve the environment past 5 dimensions, due to the curse of dimensionality and its naive sampling. Both the autoregressive (Sec. B.2) and Langevin (Sec. B.3) versions are able to solve the environment reliably up to 16 dimensions, and with nonzero success at 32 dimensions. The autoregressive version requires no new gradient stabilization, and can use only the same loss function, , but is memoryintensive, requiring separate models for dimensions. The Langevin version scales to high dimensions with only one model, but requires gradient stabilization. For more on autoregressive and Langevin generative EBMs, see [nash2019autoregressive] and [du2019implicit, du2020improved]. Which variant is used for each of our evaluation tasks is enumerated in Section D.
Appendix C Additional Experimental Details and Analysis
c.1 PerTask Summary of # Demonstrations and Environment Dimensionalities
In this section, with the table below, we highlight key aspects of the different evaluated policy learning experimental tasks, specifically the # of demonstrations for each task and the dimensionalities of the environments (comprised of the observation spaces, state spaces, and action spaces). As is highlighted in the table, the various tasks cover a wide set of challenges, including: lowdataregime tasks, and tasks with high observation, state, and/or action dimensionalities.
Demonstrations  Dimensionalities  

Domain  Task Name  #  Observations  States  Actions  Results Shown In  Comment 
D4RL HumanExperts  kitchencomplete  19  60  60  9  Table 2  
kitchenpartial  601  60  60  9  
kitchenmixed  601  60  60  9  
penhuman  50  45  45  24  
hammerhuman  25  46  46  26  
doorhuman  25  39  39  28  
relocatehuman  25  39  39  30  
Particle Integrator  "1D"Particle  2,000  4  4  1  Figure 6  
"2D"Particle  2,000  8  8  2  
"3D"Particle  2,000  12  12  3  
"4D"Particle  2,000  16  16  4  
"5D"Particle  2,000  20  20  5  
"6D"Particle  2,000  24  24  6  
"8D"Particle  2,000  32  32  8  
"16D"Particle  2,000  64  64  16  
"32D"Particle  2,000  128  128  32  
Simulated Pushing  Single Target, States  2,000  10  10  2  Table 3  
Multi Target, States  2,000  13  13  2  
Single Target, Pixels  2,000  129,600  10  2  180x240x3 image  
Planar Sweeping  Image input  50  27,648  203  3  Table 4  96x96x3 image 
State input  50  203  203  3  
BiManual Sweeping  Imageandstate input  1,000  27,660  372  12  Table 5  96x96x3 image 
Real Robot  PushRedThenGreen  95  32,400  8  2  Table 6  90x120x3 image. 
PushRed/GreenMultimodal  410  32,400  8  2  
InsertBlue  223  32,400  8  2  
SortBlueFromYellow  502  32,400  26  2 
c.2 Training and Inference Times, Implicit vs. Explicit Comparison
D4RL Train+Eval Times. Table 6 compares example training + evaluation times for the chosen bestperforming models on the D4RL tasks. We report both the training steps/second, and then also the full time for running an experiment, which comprises training to 100k steps with intermittently evaluating 100 episodes every 10k steps.
Implicit BC  Explicit BC  Comment  
Configuration  As in Section D.1  As in Section D.1  
Summary:  512 batch size  512 batch size  
512x8 MLP  2048x8 MLP  
100 Langevin iterations  
8 counter examples  
Device  TPUv3  TPUv3  
Task  doorhumanv0  doorhumanv0  
Training rate (steps/sec)  17.9  101.3  
Total train + eval time (hrs)  3.4  0.66  100k train steps, 100 evals every 10k steps 
As is shown in Table 6, the bestperforming implicit models, which are 100iteration Langevin models, take 5.6x the train+eval time compared to the bestperforming explicit models. Note that even the 3.4hour full train+eval time for the implicit model is considerably faster than what has been reported [kostrikov2021offline] for completing a train+eval on a comparable D4RL task for CQL: 16.3 hours.
RealWorld Imagebased Train and Inference Times. The following compares relevant training and inference times for our realworld tasks. In contrast to the D4RL scenario discussed above, in this scenario (a) there are large image observations to process, and (b) there are no simulated evaluations run during training. We report the training steps/sec rate, as well as the total train time, which is performed on a server of 8 GPUs. Once trained, the model is then deployed on a singleGPU machine, for which we report the inference times.
Implicit BC  Explicit BC  Comment  
Configuration  As in Section D.5  As in Section D.5  
Summary:  128 batch size  128 batch size  
90x120 images  90x120 images  
4layer ConvMaxPool  4layer ConvMaxPool  
1024x4 MLP  1024x4 MLP  
256 counter examples  
Training Device  8x V100 GPU  8x V100 GPU  
Task  PushRedThenGreen  PushRedThenGreen  
Training rate (steps/sec)  4.7  5.5  
Total train time (hrs)  5.0  5.8  100k train steps 
Inference Device  1x RTX 2080 Ti GPU  1x RTX 2080 Ti GPU  
Inference parameters  1024 samples  
3 dfo iterations  
Inference time (ms)  7.22  3.49 
Table 7 shows that for these visual models, the training times are reasonably comparable for the implicit and explicit models – 5.0 and 5.8 hours respectively. Compared to the previous D4RL scenario, this can be explained because the training time is mostly dominated by visual processing. As the implicit models use late fusion (Sec. E), the visual processing time is identical to the explicit models. For inference, the chosen implicit models show a modest increase in inference time, up to 7.22 milliseconds (ms) from 3.49 ms for the explicit model. This can be attributed to time spent on the iterative derivativefree optimization. Note that the inference time of the implicit model can be adjusted by adding/decreasing the number of samples and iterations. For example, using the same trained model but increasing the samples from 1024 to 2048 causes the inference time to increase to 9.25 ms.
c.3 Additional RealWorld Experimental Details
c.3.1 Robot Hardware Configuration, Workspace, and Objects
Our realworld experiments make use of a UFACTORY xArm6 robot arm with all state logged at 100 Hz. Observations are recorded from an Intel RealSense D415 camera, using RGBonly images at 640x360 resolution, logged at 30 Hz. The cylindrical endeffector is made from a 6 inch long plastic PVC pipe sourced from McMasterCarr (9173K515). The work surface is 24 x 18 inch smooth wood cutting board. The manipulated objects are from the Play22 Baby Blocks Shape Sorter toy kit (Play22). The targets for the tasks were constructed by hand out of wood and spraypainted black. All demonstrations were provided by a mousebased interface for providing realtime demonstrations.
The 6DOF robot is constrained to move in a 2D plane above the table. This aids in safety of the robot during operation, since it is constrained to not collide with the table and cannot provide normal forces against objects down into the table either.
c.3.2 Robot Policy and Controller
The learned visualfeedback policy operates at 5 Hz. On a GTX 2080 Ti GPU, the implicit models (configuration in Sec. D.5) complete inference in under 10 ms (see Sec. C.2), and so could be run faster than 5 Hz, but we find 5 Hz to be sufficient. The learned action space is a delta Cartesian setpoint, from the previous setpoint to the new one. The setpoints are linearly interpolated from their 5 Hz rate to be 100 Hz setpoints to our joint level controller. The joint level controller uses PyBullet [coumans2016pybullet] for inverse kinematics, and sends joint positions to the xArm6 robot at 100 Hz.
c.4 NearestNeighbor Baseline
This baseline memorizes all training data, and performs inference by looking up the closest observation in the training set and returning the corresponding action. Specifically, given a finite training dataset of pairs , denote the inputs as and outputs , preserving the ordering in both and . Given some new observation , the NearestNeighbor model, , computes:
for some norm . Specifically we used L2 norm. We experimented with normalizing all observations perdimension to be unitvariance, but did not find this to improve results. For environments with stateonly observations (no images), we can compute this exactly and quickly all in processor memory, but for the imageobservation Simulated Pushing task we tested, the dataset did not fit in memory. Accordingly, we used a random linear projection, which is known to be a viable method for nearestneighbor lookup of image data [bingham2001random], from the observation space to a 128dimensional vector. We then stored all these 128dimensional vectors in memory, and used these for NearestNeighbor lookups.
c.5 D Particle Environment Description
In this environment, the agent (i.e., particle) moves from its current configuration to a goal configuration , followed by a second goal configuration . Given its position and velocity , its action is the target position applied to a PD controller which computes acceleration according to: where target velocity , and and are environmentfixed constant gains. Initial and goal particle configurations are randomized, for each dimension, in the range for each episode, and differ between training and testing. To generate demonstrations, a scripted policy returns actions until the agent falls within a radius of , then returns actions until the agent falls within a radius of . Agent state and goal positions are used as input to the policy, which is trained to imitate the behavior of the scripted policy and tested on its capacity to generalize to new goal configurations. This task can be thought of as modeling an dimensional step function while dealing with compounding errors. The mode switch between goals presents a discontinuity that needs to be learned.
c.6 Analysis: Training Data Sparsity in the D Particle Tasks
To complement other analyses on generalization, sample complexity, and interpolation/extrapolation, we analyze in Fig. 12 another notion of generalization: training data sparsity. In the D particle experiments, as we increase N but hold the number of demonstrations constant, the training data effectively becomes much sparser over the observation space. New testtime environments for evaluation are accordingly, as increases, on average farther and farther away from the training set. This helps explain how the NearestNeighbor baseline cannot solve this task well past 1D, since memorizing the training data is insufficient, and to succeed in a higherdimensional environment a model must generalize. This analysis complements our simple 1D>1D figures on extrapolation/interpolation (Fig. 2 and Fig. 3 in the main paper) and our visual generalization and sample complexity analysis (Fig. 4 in the main paper).
c.7 Additional D4RL tasks
In the main paper we focused on the humanexpert tasks from D4RL, but here provide results on additional D4RL tasks as well. Note that the other tasks shown, except for ‘random’, use a reinforcementlearningtrained agent for the task, and this reinforcementlearning agent itself has a policy that is a unimodal continuous, explicit function approximator, and it was optimized as such. Additionally, as expected, supervised imitation learning methods, which do not make use of the additional reward information from the provided demonstrations, perform comparatively worse on tasks with suboptimal demonstrations. This is true of all tasks with “*medium*” and “*random” in their task name. Additionally, as stated in Section D, we choose the EBM hyperparameters to maximize performance on the humanexpert based environments (“Franka” and “Adroit” tasks) at the expense of lower performance on the “Gym”mujoco tasks. However, for fair comparison with other methods, and according to the standard D4RL evaluation protocol, a single set of hyperparameters was used for all tasks rather than presenting results that maximize each environment.
Baselines  Ours  
Explicit  Implicit  Explicit  Implicit  
Method  BC  CQL [kumar2020conservative]  S4RL [sinha2021s4rl]  BC (MSE)  BC (EBM)  BC (MSE)  BC (EBM)  
(from CQL)  w/ RWR [peters2007reinforcement]  w/ RWR [peters2007reinforcement]  
Uses data  
Domain  Task Name  
Franka  kitchencomplete  1.4  1.8  3.08  1.76 0.04  3.37 0.19  1.22 0.18  3.37 0.01 
kitchenpartial  1.4  1.9  2.99  1.69 0.02  1.45 0.35  1.86 0.26  2.18 0.05  
kitchenmixed  1.9  2.0  2.15 0.06  1.51 0.39  2.03 0.06  2.25 0.14  
Adroit  penhuman  1121.9  1214.0  1419.6  2141 109  2586 65  2108 58.8  2446 207 
hammerhuman  82.4  300.2  496.2  38 25  133 26  35.1 45.1  9.3 45.5  
doorhuman  41.7  234.3  736.5  79 15  361 67  17.9 13.8  399 34  
relocatehuman  5.6  2.0  2.1  3.5 1.1  0.1 2.4  3.7 0.3  
Gym  halfcheetahmedium  4202  5232  5778  4273  4086  
walker2dmedium  304  3637  4298  822  676  
hoppermedium  923  1867  2548  966  2430  
halfcheetahmediumreplay  4934  6101  4029  2766  
walker2dmediumreplay  970  1392  480  433  
hoppermediumreplay  940  1132  543  382  
halfcheetahmediumexpert  4164  7467  9528  11758  4040  
walker2dmediumexpert  520  4533  5152  640  745  
hoppermediumexpert  3621  3592  3674  909  876  
halfcheetahexpert  13004  12731  12802  9436  
walker2dexpert  5772  7067  2677  3746  
hopperexpert  3527  3557  3619  3549  
halfcheetahrandom  118  4115  6213  0  392  
walker2drandom  33  323  1145  145  1.63  
hopperrandom  308  331  331  284  308 
Appendix D Policy Learning Results Overview and Protocol
In each section below we describe the protocols for the individual simulation experiments. Note that Figure 5 was produced by averaging the performance of the best policies, for each type, within each domain across the different tasks of that domain.
For EBM variants that were used for which task: Simulated Pushing and Real World, with action dimensionality of 2, used derivativefree optimization (Sec. B.1). For Planar Sweeping, with action dimensionality 3, and BiManual Sweeping, with action dimensionality 12, we used autoregressive derivativefree optimization (Sec. B.2). D4RL, with action dimensionality between 3 and 30, used Langevin dynamics (Sec. B.3). Particle, with action dimensionality between 1 and 32, used Langevin dynamics as well. See Sec. B.4 for a comparison of variants.
d.1 D4RL Experiments
For D4RL experiments, we run sweeps over several hyperparameters for both the Implicit BC (EBM) and Explicit MSEBC models. We choose the final hyperparameters based on max average performance over 3 D4RL environments: hammerhumanv0, doorhumanv0, and relocatehumanv0. We use the same final hyperparameters across all D4RL tasks for the final results. Note that we paid closest attention to the humanteleoperation task performance when selecting a single set of hyper parameters for D4RL, particularly at the expense of slightly lower task performance on the gymmujoco D4RL tasks. For all evaluations, we report average results over 100 episodes for 3 seeds. To calculate the aggregate D4RL performance metric “D4RL HumanExperts" in Figure 5 of the paper, we first calculated the normalized performance metric for the kitchencomplete, kitchenpartial, kitchenmixed, penhuman, hammerhuman, doorhuman and relocatehuman environments, then calculated the average across all these tasks.
The following hyperparameters were used for D4RL evaluation:
D4RL Implicit BC (EBM)
Hyperparameter  Chosen Value  Swept Values 

EBM variant  Langevin  
train iterations  100,000  
batch size  512  
learning rate  0.0005  
learning rate decay  0.99  
learning rate decay steps  100  
network size (width x depth)  512x8  128x32, 512x8 
activation  ReLU  swish, ReLU 
dense layer type  spectral norm  regular, spectral norm 
train counter examples  8  1, 8, 64 
action boundary buffer  0.05  0.001, 0.05 
gradient penalty  final step only  all steps, final step only 
gradient margin  1  0.6, 1.0, 1.3 
langevin iterations  100  100, 150 
langevin learning rate init.  0.5  2.0, 1.0, 0.5, 0.1 
langevin learning rate final  1.00E05  1e4, 1e5, 1e6 
langevin polynomial decay power  2  2.0, 1.0 
langevin delta action clip  0.5  0.05, 0.1, 0.5 
langevin noise scale  0.5  0.5, 1.0 
langevin 2nd iteration learning rate  1.00E05  1e1, 1e2, 1e5 
Shown also is an indication of training stability, across 5 different seeds, shown for the pen task.
D4RL Explicit MSEBC
Hyperparameter  Chosen Value  Swept Values 

train iterations  100,000  
batch size  512  
sequence length  2  
learning rate  0.001  1e3, 0.5e3 
learning rate decay  0.99  
learning rate decay steps  200  
dropout rate  0.1  0.0, 0.1 
network size (width x depth)  2048x8  128x16, 128x32, 512x16, 512x32, 1024x4, 1024x8, 2048x4, 2048x8 
activation  ReLU 
d.2 Simulated Pushing Experiments
For Simulated Pushing experiments, we run separate sweeps for each model for each of the States and Pixels versions of the task. All chosen hyperparameter sweeps and chosen values are given in tables below, and results are reported as the average of 100 episodes for 3 seeds.
Simulated Pushing, States, Implicit BC (EBM)
Hyperparameter  Chosen Value  Swept Values 

EBM variant  DFO  
train iterations  100,000  
batch size  512  
sequence length  2  2, 4 
learning rate  0.001  
learning rate decay  0.99  
learning rate decay steps  100  
network size (width x depth)  128x8  2048x4, 128x8, 128x16, 128x32 
activation  ReLU  
dense layer type  regular  
train counter examples  256  
action boundary buffer  0.05  
gradient penalty  none  
dfo samples  16384  
dfo iterations  3 
Simulated Pushing, States, Explicit MSEBC
Hyperparameter  Chosen Value  Swept Values 

train iterations  100,000  
batch size  512  
sequence length  2  
learning rate  0.0005  4e3, 2e3, 1e3, 0.5e3, 0.2e3 
learning rate decay  0.99  
learning rate decay steps  100  100, 150, 200, 400 
dropout rate  0.1  
network size (width x depth)  1024x8  1024x4, 1024x8, 2048x4, 2048x8 
activation  ReLU 
Simulated Pushing, States, Explicit MDNBC
Hyperparameter  Chosen Value  Swept Values 

train iterations  100,000  
batch size  512  
sequence length  2  
learning rate  0.001  
learning rate decay  0.99  
learning rate decay steps  100  
dropout rate  0.1  
network size (width x depth)  512x8  512x8, 512x16 
training temperature  1.0  0.5, 1.0, 2.0 
test temperature  1.0  0.5, 1.0, 2.0 
test variance exponent  1.0  1.0, 4.0 
Simulated Pushing, Pixels, Implicit BC (EBM)
Hyperparameter  Chosen Value  Swept Values 

EBM variant  DFO  
train iterations  100,000  
batch size  128  128, 256 
sequence length  2  
learning rate  0.001  
learning rate decay  0.99  
learning rate decay steps  100  
image size  240x180  120x90, 240x180 
MLP network size (width x depth)  1024x4  512x4, 1024x4, 256x14, 256x26, 1024x14, 1024x26 
Conv. Net.  4layer ConvMaxPool  
activation  ReLU  
dense layer type  regular  
train counter examples  256  
action boundary buffer  0.05  
gradient penalty  none  
dfo samples  4096  1024, 4096, 16384 
dfo iterations  3 
Simulated Pushing Pixels MSEBC
Hyperparameter  Chosen Value  Swept Values 

train iterations  100,000  
batch size  64  
sequence length  2  
learning rate  0.001  
learning rate decay  0.99  
learning rate decay steps  100  
image size  240x180  120x90, 240x180 
dropout rate (MLP only)  0.1  
network size (width x depth)  512x4  128x2, 128x4, 512x2, 512x4 
Conv. Net.  4layer ConvMaxPool  
activation  ReLU  
coord conv  True  True, False 
Simulated Pushing Pixels MDNBC
Hyperparameter  Chosen Value  Swept Values 

train iterations  100,000  
batch size  32  
sequence length  2  
learning rate  0.001  
learning rate decay  0.99  
learning rate decay steps  100  
dropout rate (MLP only)  0.1  
image size  120x90  120x90, 240x180 
network num components  26  
network size (width x depth)  512x8  512x8, 512x16 
Conv. Net.  4layer ConvMaxPool  
activation  ReLU  
training temperature  2.0  0.5, 1.0, 2.0 
test temperature  2.0  0.5, 1.0, 2.0 
test variance exponent  4.0  1.0, 4.0 
d.3 Simulated D Particle Environment Experiments
For a detailed description of this environment and its dynamics, see Section C.5. We used the following hyper parameters for evaluation on this environment:
Particle Implicit BC (EBM)
Hyperparameter  Chosen Value 

EBM variant  Langevin 
train iterations  50,000 
batch size  128 
sequence length  2 
learning rate  0.001 
learning rate decay  0.99 
learning rate decay steps  100 
network size (width x depth)  128x16 
activation  ReLU 
dense layer type  spectral norm 
train counter examples  64 
gradient penalty  final step only 
gradient margin  1 
langevin iterations  50 
langevin learning rate init.  0.1 
langevin learning rate final  1.00E05 
langevin polynomial decay power  2 
langevin delta action clip  0.1 
langevin noise scale  1.0 
langevin 2nd iteration learning rate  not used 
Particle Explicit MSEBC
Hyperparameter  Chosen Value 

train iterations  100,000 
batch size  512 
sequence length  2 
learning rate  0.001 
learning rate decay  0.99 
learning rate decay steps  200 
dropout rate  0.1 
network size (width x depth)  128x16 
activation  ReLU 
d.4 Simulated Sweeping Experiments
For Planar Sweeping, for both explicit and implicit models, results are shown for different types of encoders, and different of Dense ResNet layers (Sec. E) shown in the table, each is the average of 100 evaluations each across 3 different seeds. The best models, for each implicit and explicit, were taken from Planar Sweeping and evaluated on BiManual Sweeping.
We used the following hyper parameters for evaluation on the simulated planar sweeping, and bimanual sweeping environment:
Planar Sweeping Implicit BC (EBM)
Hyperparameter  Chosen Value  Swept Values 

EBM variant  Autoregressive  
train iterations  1,000,000  
batch size  64  
sequence length  2  
learning rate  1e4  1e3, 1e4 
Conv. Net.  ConvResNet  
# encoder features  64  
# Conv ResNet encoder layers  26  
# spatial softmax heads  64  8, 16, 32, 64 
# dense ResNet layers  20  8, 14, 20 
activation  ReLU  
train counter examples per action dim  1024  128, 256, 512, 1024 
inference examples per action dim  1024  128, 256, 512, 1024 
Planar Sweeping Explicit MSEBC
Hyperparameter  Chosen Value  Swept Values 

train iterations  1,000,000  
batch size  64  
sequence length  2  
learning rate  1e4  1e3, 1e4 
Conv. Net.  ConvResNet  
# encoder features  64  
# Conv ResNet encoder layers  26  
# spatial softmax heads  64  8, 16, 32, 64 
# dense ResNet layers  20  8, 14, 20 
activation  ReLU 
Bimanual Sweeping Implicit BC (EBM)
Hyperparameter  Chosen Value 

EBM variant  Autoregressive 
train iterations  1,000,000 
batch size  32 
sequence length  2 
learning rate  1e4 
Conv. Net.  ConvResNet 
# encoder features  64 
# Conv ResNet encoder layers  26 
# spatial softmax heads  64 
# dense ResNet layers  20 
activation  ReLU 
train counter examples per action dim  1024 
inference examples per action dim  1024 
Bimanual Sweeping Explicit MSEBC
Hyperparameter  Chosen Value 

train iterations  1,000,000 
batch size  32 
sequence length  2 
learning rate  1e4 
Conv. Net.  ConvResNet 
# encoder features  64 
# Conv ResNet encoder layers  26 
# spatial softmax heads  64 
# dense ResNet layers  20 
activation  ReLU 
d.5 Realworld Pushing Experiments
For Real World, explicit and implicit models were taken from Simulated Pushing, Pixels, and applied to the real world. We used the following hyper parameters for evaluation on the realworld pushing environments:
Realworld Tasks Pixels Implicit BC (EBM)
Hyperparameter  Pushing  Pushing Multimodal  Insertion  Sorting 

EBM variant  DFO  DFO  DFO  DFO 
train iterations  100,000  100,000  100,000  100,000 
batch size  128  256  256  256 
sequence length  2  2  2  2 
learning rate  0.001  0.001  0.001  0.001 
learning rate decay  0.99  0.99  0.99  0.99 
learning rate decay steps  100  100  100  100 
image size  120x90  120x90  120x90  120x90 
MLP network size (width x depth)  1024x4  1024x4  2048x4  1024x4 
Conv. Net.  4layer ConvMaxPool  4layer ConvMaxPool  4layer ConvMaxPool  4layer ConvMaxPool 
activation  ReLU  ReLU  ReLU  ReLU 
dense layer type  regular  regular  regular  regular 
train counter examples  256  256  256  256 
action boundary buffer  0.05  0.05  0.05  0.05 
gradient penalty  none  none  none  none 
dfo samples  1024  1024  2048  2048 
dfo iterations  3  3  3  3 
Pushing Pixels MSEBC
Hyperparameter  Pushing  Pushing Multimodal  Insertion  Sorting 

train iterations  100,000  100,000  100,000  100,000 
batch size  128  128  128  128 
sequence length  2  2  2  2 
learning rate  0.001  0.001  0.001  0.001 
learning rate decay  0.99  0.99  0.99  0.99 
learning rate decay steps  100  100  100  100 
image size  120x90  120x90  120x90  120x90 
dropout rate (MLP only)  0.1  0.1  0.1  0.1 
MLP network size (width x depth)  512x4  1024x4  1024x4  1024x4 
Conv. Net.  4layer ConvMaxPool  4layer ConvMaxPool  4layer ConvMaxPool  4layer ConvMaxPool 
activation  ReLU  ReLU  ReLU  ReLU 
Appendix E Model Architectures
e.1 MLPs
For nonimageobservation models, we use MLPs (Multi Layer Perceptrons) that when used as EBMs (Fig. 13a), take in the actions and output an energy in , or when trained as MSE models instead output the actions. All results shown used ReLU activations, although we experimented with Swish as well. Configurable model elements consisted of: Dropout [srivastava2014dropout], using ResNet skip connections [he2016identity], and spectral normalization dense layers instead of regular dense layers [miyato2018spectral].
e.2 ConvMLPs
For visuomotor models (Fig. 13b), we use the common ConvMLP [levine2016end] style architecture, but when used as an EBM, concatenate actions with image encodings from a CNN model. The MLP portion is identical to the section above. For the CNNs, for all models for the sweeping experiments, we used 26layer ResNets [he2016deep] (“ConvResNets”) which maintain fullimage spatial resolution before the encoder. For the simulated and realworld pushing experiments, we used a progressivelyspatiallyreduced model (“ConvMaxPool”) composed of interleaving convolutions with maxpooling, with feature dimensions [32, 64, 128, 256]. Both models used 3x3 convolution kernels. Configurable options include: using CoordConv [liu2018intriguing], i.e. a pixel coordinate map augmented as input, and either spatial soft (arg)max [levine2016end] or global average pooling encoders.
Appendix F Proofs
f.1 Definitions
A function is Lipschitz continuous with constant if for all . We say that is Lipschitz, so a 1Lipschitz function is a function that is continuous with Lipschitz constant 1. The magnitude of the gradient of an Lipschitz function is always less than or equal to .
The distance function from a point to a nonempty set, is defined as:
A closed set is a set that contains all of its boundary points (points that can be approached from the interior and exterior of the set). Equivalently, a set if closed if and only if it contains all of its limit points (points that are the limit of some sequence of points in the set).
The power set of , is the set of all subsets of including the empty set and all of .
The graph, , of a function is the set of points:
The graph, , of a multivalued function is the set of points:
f.2 Proofs
Lemma 3.
The distance function from any point to a nonempty set, , is welldefined and 1Lipschitz.
Proof.
The distance function from a point to a nonempty set, is defined as:
The set of distance values is a set of positive real numbers, so the infimum exists due to the completeness of . Therefore the distance function is well defined.
For any , let be a point in the closure of with . Then, to establish continuity using the triangle inequality, we can state that for a given at a distance from (as pictured in Fig. 13(a)),
is an infimum  
by the triangle inequality  
and can be exchanged.  
Since and can be reversed we have, and thus is continuous over with a Lipschitz constant of 1. ∎
Lemma 4.
If is the distance function to a closed set , then for every there exists an element such that .
Proof.
Let be a closed ball of radius around . The distance from to is equal to the distance from to . Since is defined as an infimum, there must exist an infinite sequence of points with distances whose limit is . The set is closed and bounded and, therefore, compact. The infinite sequence must therefore have at least one subsequence that converges to a point . Since the distances of the full series converge to , we know that . ∎
Lemma 5.
For any continuous function , the distance to the graph of is a continuous function , such that for all .
Proof.
Let be the distance in from the point to the graph of .
The graph, , of a function is the set of points:
Since the graph is a nonempty set the distance function is well defined and continuous, as shown in Lemma 3.
We must still show that for all . We know that , because is a distance function.
For any , clearly , since the point and thus the distance from to a point in is zero.
Consider a point where and therefore . Since is continuous, is closed and there will exist a point, that achieves the infimum, .
At least one of or , so .
Therefore, for any , achieves its unique minimum at and thus . ∎
We have shown that for we can construct a continuous that satisfies for all if is singlevalued and continuous. However, the functions we are modeling are often discontinuous or multivalued. If the singlevalued function is discontinuous, there will be open boundaries on the graph where the point that minimizes the distance function is not in the graph of (Fig. 13(d)). In that example, there will be two values of that minimize for the same value of , in which case will not be well defined as a singlevalued function. We can disambiguate the two cases to get a welldefined , but we cannot reliably recover the original at the discontinuity.
In order to handle discontinuities and multivalued functions, we will extend the definition to allow functions that map to multiple values, . The multivalued function maps from to the power set , which is the set of all subsets of , except the empty set. We no longer require continuity, but instead directly require the one important property of a continuous function that was used in the proof of Lemma 5, namely that the graph of is closed. In the simple case of a jump discontinuity (as in fig. 13(d)), the function must include both sides of the discontinuity.
Theorem 1.
For any multivalued (setvalued) function where the graph of is closed, there exists a 1Lipschitz function , such that for all .
Proof.
The graph, , of a multivalued function is the set of points:
We can again define as the distance to . Because is a nonempty set, we know that is welldefined and uniformly continuous (Lemma 3).
We will now show that for all points in and for all points not in