State-Based Global Pushing Environment#

The StateBasedGlobalPushingEnv is an object manipulation environment designed for reinforcement learning with MagLev systems. This environment allows agents to learn pushing tasks with various object shapes and supports both position-only and pose learning modes. The term ‘global’ indicates that a global rather than a local motion planning strategy is learned.

The aim is to push an object with one or more movers to the desired goal position and/or orientation without collisions. A collision is detected, if any two movers collide or if at least one mover leaves the tiles (think of this as a collision with a wall). Starting positions of movers and objects, as well as object properties and goal configurations, are randomized at the start of each episode to ensure diverse training scenarios. The agent controls the movers by specifying either the jerk or acceleration. In this environment, positions, velocities, accelerations, and jerks have the units m, m/s, m/s² and m/s³, respectively.

Observation Space#

The observation space of this environment is a dictionary containing the following keys and values:

Key	Value
observation	if `learn_jerk=True`: a numpy array of shape (num_movers * 6,) containing the (x,y)-position, (x,y)-velocities and (x,y)-accelerations of each mover if `learn_jerk=False`: a numpy array of shape (num_movers * 4,) containing the (x,y)-position and (x,y)-velocities of each mover
achieved_goal	if `learn_pose=False`: a numpy array of shape (2,) containing the current (x,y)-position of the object if `learn_pose=True`: a numpy array of shape (4,) containing the current (x,y)-position and yaw orientation (sin, cos) of the object
desired_goal	if `learn_pose=False`: a numpy array of shape (2,) containing the desired (x,y)-position of the object if `learn_pose=True`: a numpy array of shape (4,) containing the desired (x,y)-position and yaw orientation (sin, cos) of the object

Action Space#

The action space is continuous and has dimensionality num_movers * 2. Actions directly specify the desired dynamic values within the configured limits.

If learn_jerk=True, an action

\[a_j = [a_{j1x}, a_{j1y}, ..., a_{jNx}, a_{jNy}]^T \in [-j_{max}, j_{max}]^{2N}\]

represents the desired jerks for N movers in x and y directions (unit: m/s³), where j_max is the maximum possible jerk.

If learn_jerk=False, an action

\[a_a = [a_{a1x}, a_{a1y}, ..., a_{aNx}, a_{aNy}]^T \in [-a_{max}, a_{max}]^{2N}\]

represents the desired accelerations for N movers in x and y directions (unit: m/s²), where a_max is the maximum possible acceleration.

Immediate Rewards#

The reward function is designed to encourage efficient object manipulation while penalizing collisions:

Success reward:
- if lean_pose=True: object_at_goal_reward given when the object reaches its goal position without any collisions (default: +1.0)
- if lean_pose=False: The coverage achieved in the current timestep without any collisions (object_at_goal_reward is ignored)
Collision penalty (collision_penalty, default: -10.0): Applied when a collision occurs
Step penalty (per_step_penalty, default: -0.01): Small negative reward applied at each timestep to encourage efficiency

Goal achievement criteria:

Position mode (learn_pose=False): Object position must be within max_position_err distance of the goal
Pose mode (learn_pose=True): Object must achieve at least min_coverage overlap ratio with the desired pose

Episode Termination and Truncation#

Episodes terminate under the following conditions:

Early termination: When early_termination_steps is set and the object remains at goal for the specified number of consecutive steps
Time limit: Episodes are truncated after reaching the maximum number of steps (typically managed by Gymnasium’s TimeLimit wrapper)
Collision: Episodes terminate immediately when a collision occurs.

Note

The term ‘episode steps’ refers to the number of calls to env.step(). Each step consists of num_cycles MuJoCo simulation steps.

Environment Reset#

When the environment is reset, the following randomization occurs:

Object properties: Type, dimensions, mass, and orientation can be sampled from configured ranges. For more information, please see the following section on object shapes.
Object placement: Start and goal positions are randomly chosen within valid boundaries
Mover placement: Initial positions are sampled ensuring collision-free placement
Collision avoidance: The system ensures sufficient distance between movers and objects, and between object start/goal positions

The reset process includes validation to ensure:

No initial collisions between movers and walls, i.e. each mover is above a tile
No initial collisions between movers and the object
Sufficient space for movers to maneuver around the object
Minimum distance between object start and goal positions

Object Shapes#

The environment supports six different object shapes:

square_box: Square box with equal width and height
box: Rectangular box with different width and height
cylinder: Cylindrical object
t_shape: T-shaped object composed of multiple box segments
l_shape: L-shaped object with vertical and horizontal segments
plus_shape: Plus/X-shaped object with perpendicular segments

../_images/img_objects_isometric.png — Overview of all supported object types.#

In the default configuration, all object parameters are fixed, and the default object is a square box. To use the environment with another object shape, set object_type to one of the object shapes listed above. These object shapes are parameterized with min and max values (min, max), and the environment automatically selects new object parameters at random if min < max applies for a parameter range. The parameter ranges can be changed by setting new values using object_ranges.

The following object parameters can be configured:

w: Width of the object (or object segments) in m (default: (0.06, 0.06))
h: Height of the object (or object segments) in m (default: (0.06, 0.06))
r: Radius for cylindrical objects in m (default: (0.02, 0.02))
s: Segment width for composite shapes (T, L, plus) in m (default: (0.02, 0.02))
d: Depth parameter for segment positioning in composite shapes in m (default: (0.05, 0.05))
mass: Mass of the object in kg (default: (0.3, 0.3))

../_images/img_object_geometry_labeled.png

Basic Usage#

The following example shows how to train an agent using Stable-Baselines3. To use the example, please install Stable-Baselines3 as described in the documentation.

Note

This is a simplified example that is not guaranteed to converge, as the default parameters are used. However, it is important to note that the parameter copy_info_dict is set to True. This way, it is not necessary to check for collision again to compute the reward when a transition is relabeled by HER, since the information is already available in the info-dict.

import numpy as np
import gymnasium as gym
from stable_baselines3 import SAC, HerReplayBuffer
import magbotsim

gym.register_envs(magbotsim)

render_mode = None
mover_params = {'size': np.array([0.113 / 2, 0.113 / 2, 0.012 / 2]), 'mass': 0.628}
collision_params = {'shape': 'box', 'size': np.array([0.113 / 2 + 1e-6, 0.113 / 2 + 1e-6]), 'offset': 0.0, 'offset_wall': 0.0}
env_params = {'mover_params': mover_params, 'collision_params': collision_params, 'render_mode': render_mode}

env = gym.make('StateBasedObjectPushingEnv-v0', **env_params)
# copy_info_dict=True, as information about collisions is stored in the info dictionary to avoid
# computationally expensive collision checking calculations when the data is relabeled (HER)
model = SAC(
    policy='MultiInputPolicy',
    env=env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs={'copy_info_dict': True},
    verbose=1
)
model.learn(total_timesteps=int(1e6))

Version History#

v0: initial version of the environment

Parameters#

class magbotsim.rl_envs.object_manipulation.pushing.state_based_global_pushing_env.StateBasedGlobalPushingEnv(num_movers: int = 1, mover_params: dict[str, Any] | None = None, layout_tiles: ndarray | None = None, initial_mover_zpos: float = 0.003, std_noise: ndarray | float = 1e-05, render_mode: str | None = 'human', render_every_cycle: bool = False, num_cycles: int = 40, collision_params: dict[str, Any] | None = None, object_type: str = 'square_box', object_ranges: MutableMapping[str, tuple[float, float]] = {'depth': (0.02, 0.02), 'height': (0.06, 0.06), 'mass': (0.3, 0.3), 'radius': (0.05, 0.05), 'segment_width': (0.02, 0.02), 'width': (0.06, 0.06)}, v_max: float = 2.0, a_max: float = 10.0, j_max: float = 100.0, object_sliding_friction: float = 1.0, object_torsional_friction: float = 0.005, learn_jerk: bool = False, learn_pose: bool = False, early_termination_steps: int | None = None, max_position_err: float = 0.05, min_coverage: float = 0.9, collision_penalty: float = -10.0, per_step_penalty: float = -0.01, object_at_goal_reward: float = 1.0, use_mj_passive_viewer: bool = False)[source]

Bases: BasicMagBotSingleAgentEnv

An object pushing environment.

Parameters:

num_movers – the number of movers in the environment, defaults to 1
mover_params –
a dictionary that can be used to specify the mass and size of each mover using the keys ‘mass’ or ‘size’, defaults to None. To use the same mass and size for each mover, the mass can be specified as a single float value and the size as a numpy array of shape (3,). However, the movers can also be of different types, i.e. different masses and sizes. In this case, the mass and size should be specified as numpy arrays of shapes (num_movers,) and (num_movers,3), respectively. If set to None or only one key is specified, both mass and size or the missing value are set to the following default values:
- mass: 1.24 [kg]
- size: [0.155/2, 0.155/2, 0.012/2] (x,y,z) [m] (note: half-size)
layout_tiles – a numpy array of shape (height, width) that specifies the layout of the tiles, defaults to None. If None, a 4x3 grid of tiles is used.
initial_mover_zpos – the initial distance between the bottom of the mover and the top of a tile, defaults to 0.003
std_noise – the standard deviation of a Gaussian with zero mean used to add noise, defaults to 1e-5. The standard deviation can be used to add noise to the mover’s position, velocity and acceleration. If you want to use different standard deviations for position, velocity and acceleration use a numpy array of shape (3,); otherwise use a single float value, meaning the same standard deviation is used for all three values.
render_mode – the mode that is used to render the frames (‘human’, ‘rgb_array’ or None), defaults to ‘human’. If set to None, no viewer is initialized and used, i.e. no rendering. This can be useful to speed up training.
render_every_cycle – whether to call ‘render’ after each integrator step in the step() method, defaults to False. Rendering every cycle leads to a smoother visualization of the scene, but can also be computationally expensive. Thus, this parameter provides the possibility to speed up training and evaluation. Regardless of this parameter, the scene is always rendered after ‘num_cycles’ have been executed if render_mode != None.
num_cycles – the number of control cycles for which to apply the same action, defaults to 40
collision_params –
a dictionary that can be used to specify the following collision parameters, defaults to None:
- collision shape (key: ‘shape’): can be ‘box’ or ‘circle’, defaults to ‘circle’
- size of the collision shape (key: ‘size’), defaults to 0.11 [m]:
  - collision shape ‘circle’:
    a single float value which corresponds to the radius of the circle, or a numpy array of shape (num_movers,) to specify individual values for each mover
  - collision shape ‘box’:
    a numpy array of shape (2,) to specify x and y half-size of the box, or a numpy array of shape (num_movers, 2) to specify individual sizes for each mover
- additional size offset (key: ‘offset’), defaults to 0.0 [m]: an additional safety offset that is added to the size of the
  collision shape. Think of this offset as increasing the size of a mover by a safety margin.
- additional wall offset (key: ‘offset_wall’), defaults to 0.0 [m]: an additional safety offset that is added to the size
  of the collision shape to detect wall collisions. Think of this offset as moving the wall, i.e. the edge of a tile without an adjacent tile, closer to the center of the tile.
object_types – a sequence of strings that specifies which object types can be used, defaults to DEFAULT_OBJECT_TYPES. Valid object types are: ‘square_box’, ‘box’, ‘cylinder’, ‘t_shape’, ‘l_shape’, ‘plus_shape’
object_ranges – a dictionary that specifies the range of object parameters for random sampling, defaults to DEFAULT_OBJECT_RANGES. Keys can include ‘width’, ‘height’, ‘depth’, ‘segment_width’, ‘radius’, ‘mass’. Each value is a tuple (min_value, max_value)
object_sliding_friction – the sliding friction coefficient of the object, defaults to 1.0
object_torsional_friction – the torsional friction coefficient of the object, defaults to 0.005
v_max – the maximum velocity, defaults to 2.0 [m/s]
a_max – the maximum acceleration, defaults to 10.0 [m/s²]
j_max – the maximum jerk (only used if ‘learn_jerk=True’), defaults to 100.0 [m/s³]
learn_jerk – whether to learn the jerk, defaults to False. If set to False, the acceleration is learned, i.e. the policy output.
learn_pose – whether to learn both position and orientation of the object, defaults to False. If set to False, only position is learned.
early_termination_steps – the number of consecutive steps at goal after which the episode terminates early, defaults to None (no early termination)
max_position_err – the position threshold used to determine whether the object has reached its goal position, defaults to 0.05 [m]
min_coverage – the minimum coverage ratio for goal achievement when learn_pose=True, defaults to 0.9
collision_penalty – the reward penalty applied when a collision occurs, defaults to -10.0
per_step_penalty – the small negative reward applied at each time step to encourage efficiency, defaults to -0.01
object_at_goal_reward – the positive reward given when the object reaches the goal without collisions, defaults to 1.0
use_mj_passive_viewer – whether the MuJoCo passive_viewer should be used, defaults to False. If set to False, the Gymnasium MuJoCo WindowViewer with custom overlays is used.

_after_mujoco_step_callback()[source]: Check whether corrective movements (overshoot or distance corrections) occurred and increase the corresponding counter if necessary.

_before_mujoco_step_callback(action: ndarray) → None[source]

Apply the next action, i.e. it sets the jerk or acceleration, ensuring the minimum and maximum velocity and acceleration (for one cycle).

Parameters:: action – a numpy array of shape (num_movers * 2,), which specifies the next action (jerk or acceleration)

_custom_xml_string_callback(custom_model_xml_strings: dict | None) → dict[str, str][source]

Generate custom XML strings for the environment model.

This method is called during model initialization to add custom objects and modifications to the base MuJoCo model.

Returns:: a dict of XML strings to be inserted into the model

_reset_callback(options: dict[str, Any] | None = None) → None[source]

Reset the start positions of the movers and object and the object goal position and reload the model. It is ensured that the new start positions of the movers are collision-free, i.e. no wall collision, no collision with other movers and no collision with the object. In addition, the object’s start position is chosen such that the mover fits between the wall and the object. This is important to ensure that the object can be pushed in all directions.

Parameters:: options – can be used to override mover and object parameters

close() → None[source]: Close the environment.

compute_reward(achieved_goal: ndarray, desired_goal: ndarray, info: dict[str, Any] | None = None) → ndarray | float[source]

Compute the immediate reward.

Parameters:

achieved_goal – a numpy array of shape (batch_size, length achieved_goal) or (length achieved_goal,) containing the already achieved (x,y)-positions of an object
desired_goal – a numpy array of shape (batch_size, length desired_goal) or (length desired_goal,) containing the (x,y) goal positions of an object
info – a dictionary containing auxiliary information, defaults to None

Returns:

a single float value or a numpy array of shape (batch_size,) containing the immediate rewards

compute_terminated(achieved_goal: ndarray, desired_goal: ndarray, info: dict[str, Any] | None = None) → ndarray | bool[source]

Compute whether the episode should terminate.

The episode terminates if there is a collision between movers, a collision with a wall, or if early termination conditions are met (when configured).

Parameters:

achieved_goal – a numpy array of shape (batch_size, length achieved_goal) or (length achieved_goal,) containing the already achieved (x,y)-positions of an object
desired_goal – a numpy array of shape (batch_size, length desired_goal) or (length desired_goal,) containing the (x,y) goal positions of an object
info – a dictionary containing auxiliary information, defaults to None

Returns:

if batch_size > 1:
a numpy array of shape (batch_size,). An entry is True if the state is terminal, False otherwise
if batch_size = 1:
True if the state is terminal, False otherwise

compute_truncated(achieved_goal: ndarray, desired_goal: ndarray, info: dict[str, Any] | None = None) → ndarray | bool[source]

Check whether the truncation condition is satisfied. The truncation condition (a time limit in this environment) is automatically checked by the Gymnasium TimeLimit Wrapper, which is why this method always returns False.

Parameters:

achieved_goal – a numpy array of shape (batch_size, length achieved_goal) or (length achieved_goal,) containing the already achieved (x,y)-positions of an object
desired_goal – a numpy array of shape (batch_size, length desired_goal) or (length desired_goal,) containing the (x,y) goal positions of an object
info – a dictionary containing auxiliary information, defaults to None

Returns:

if batch_size > 1:
a numpy array of shape (batch_size,) in which all entries are False
if batch_size = 1:
False

ensure_max_dyn_val(current_values: ndarray, max_value: float, next_derivs: ndarray) → tuple[ndarray, ndarray][source]

Ensure the minimum and maximum dynamic values.

Parameters:

current_values – the current velocity or acceleration specified as a numpy array of shape (2,) or (num_checks,2)
max_value – the maximum velocity or acceleration (float)
next_derivs – the next derivative (acceleration or jerk) used for one integrator step specified as a numpy array of shape (2,) or (num_checks,2)

Returns:

the next velocity or acceleration and the next derivative (acceleration or jerk) corresponding to the next action that must be applied to ensure the minimum and maximum dynamics (each of shape (num_checks,2))

get_current_num_distance_corrections() → int[source]

Return the current number of distance corrections measured within the current episode.

Returns:: the current number of distance corrections measured within the current episode

get_current_num_overshoot_corrections() → int[source]

Return the current number of overshoot corrections measured within the current episode.

Returns:: the current number of overshoot corrections measured within the current episode

reload_model(mover_start_xy_pos: ndarray | None = None) → None[source]

Generate a new model XML string with new start positions for mover and object and a new object goal position and reload the model. In this environment, it is necessary to reload the model to ensure that the actuators work as expected.

Parameters:: mover_start_xy_pos – None or a numpy array of shape (num_movers,2) containing the (x,y) starting positions of each mover, defaults to None. If set to None, the movers will be placed in the center of the tiles that are added to the XML string first.

update_cached_actuator_mujoco_data() → None[source]: Update all cached information about MuJoCo actuators, such as names and ids.