State-Based Pushing Environment with Static Obstacles#


The StateBasedStaticObstaclePushingEnv is a simple pushing environment with two static obstacles. The aim is to push the object with the mover to the desired goal position without collisions by specifying either the jerk or the acceleration of the mover. A collision is detected, if the mover leaves the tiles (think of this as a collision with a wall) or if the mover or the object collide with one of the obstacles.

../_images/img_state_based_pusing_env_with_static_obstacles.png

The difficulty of the task can be adjusted by changing the size of the obstacles, i.e. the larger the obstacles, the greater the difficulty. The size of the obstacles can be changed by setting the obstacle_mode. The following modes are available:

  • simple:

../_images/img_state_based_static_obstacle_pushing_env_simple.png
  • medium:

../_images/img_state_based_static_obstacle_pushing_env_medium.png
  • hard:

../_images/img_state_based_static_obstacle_pushing_env_hard.png
  • random:

    The size of the obstacles is chosen uniformly at random. The minimum obstacle size is the one used in the obstacle_mode simple. The maximum obstacle size is the one used in the obstacle_mode hard.

  • curriculum:

    Increases the size of the obstacle using a curriculum. The curriculum is kept simple. The initial obstacle_mode is simple. After 800 episodes, the obstacle_mode is automatically set to medium. After an additional 800 episodes, the obstacle_mode is changed to hard for the remaining episodes.

The start positions of the mover and object as well as the object’s goal position are chosen randomly at the start of a new episode. This environment contains only one object and one mover. The tiles are arranged in a 4x3 layout. In this environment, positions, velocities, accelerations, and jerks have the units m, m/s, m/s² and m/s³, respectively.

Observation Space#

The observation space of this environment is a dictionary containing the following keys and values:

Key

Value

observation

  • if learn_jerk=True: a numpy array of shape (22,) containing the (x,y)-position, (x,y)-velocities, (x,y)-accelerations of the mover, as well as the x and y positions of the vertices of both obstacles

  • if learn_jerk=False: a numpy array of shape (20,) containing the (x,y)-position and (x,y)-velocities, (x,y)-accelerations of the mover, as well as the x and y positions of the vertices of both obstacles

achieved_goal

a numpy array of shape (2,) containing the current (x,y)-position of the object w.r.t. the base frame

desired_goal

a numpy array of shape (2,) containing the desired (x,y)-position of the object w.r.t. the base frame

Action Space#

The action space is continuous and 2-dimensional. If learn_jerk=True, an action

\[a_j = [a_{jx}, a_{jy}]^T \in [-j_{max},j_{max}]²\]

represents the desired jerks in x and y direction of the base frame (unit: m/s³), where j_max is the maximum possible jerk (see environment parameters).

Accordingly, if learn_jerk=False, an action

\[a_a = [a_{ax}, a_{ay}]^T \in [-a_{max},a_{max}]²\]

represents the accelerations in x and y direction of the base frame (unit: m/s²), where a_max is the maximum possible acceleration (see environment parameters).

Immediate Rewards#

The agent receives a reward of 0 if the object has reached its goal without a collision. For each timestep in which the object has not reached its goal and in which there is no collision, the environment emits a small negative reward of -1. In the case of a collision, the agent receives a large negative reward of -50.

Episode Termination and Truncation#

Each episode has a time limit of 25 environment steps. If the time limit is reached, the episode is truncated. Thus, each episode has 25 environment steps, except that the mover collides with a wall. In this case, the episode terminates immediately regardless of the time limit. An episode is not terminated when the object reaches its goal position, as the object has to remain close to the desired goal position until the end of an episode.

Note

The term ‘episode steps’ refers to the number of calls of env.steps(). The number of MuJoCo simulation steps, i.e. control cycles, is a multiple of the number of episode steps.

Environment Reset#

When the environment is reset, new (x,y) start positions for the mover and the object as well as a new goal position are chosen at random. It is ensured that the new start position of the mover is collision-free. In addition, the object’s start position is chosen such that the mover fits between the wall and the object. This is important to ensure that the object can be pushed in all directions.

Basic Usage#

The following example shows how to train an agent using Stable-Baselines3. To use the example, please install Stable-Baselines3 as described in the documentation.

Note

This is a simplified example that is not guaranteed to converge, as the default parameters are used. However, it is important to note that the parameter copy_info_dict is set to True. This way, it is not necessary to check for collision again to compute the reward when a transition is relabeled by HER, since the information is already available in the info-dict.

import numpy as np
import gymnasium as gym
from stable_baselines3 import SAC, HerReplayBuffer
import magbotsim

gym.register_envs(magbotsim)

render_mode = None
mover_params = {'size': np.array([0.113 / 2, 0.113 / 2, 0.012 / 2]), 'mass': 0.628}
collision_params = {'shape': 'box', 'size': np.array([0.113 / 2 + 1e-6, 0.113 / 2 + 1e-6]), 'offset': 0.0, 'offset_wall': 0.0}
env_params = {'mover_params': mover_params, 'collision_params': collision_params, 'render_mode': render_mode}

env = gym.make('StateBasedStaticObstaclePushingEnv-v0', **env_params)
# copy_info_dict=True, as information about collisions is stored in the info dictionary to avoid
# computationally expensive collision checking calculations when the data is relabeled (HER)
model = SAC(
    policy='MultiInputPolicy',
    env=env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs={'copy_info_dict': True},
    verbose=1
)
model.learn(total_timesteps=int(1e6))

Version History#

  • v0: initial version of the environment

Parameters#

class magbotsim.rl_envs.object_manipulation.pushing.state_based_static_obstacle_env.StateBasedStaticObstaclePushingEnv(obstacle_mode: str = 'simple', mover_params: dict[str, Any] | None = None, initial_mover_zpos: float = 0.002, std_noise: ndarray | float = 1e-05, render_mode: str | None = 'human', render_every_cycle: bool = False, num_cycles: int = 40, collision_params: dict[str, Any] | None = None, v_max: float = 2.0, a_max: float = 10.0, j_max: float = 100.0, learn_jerk: bool = False, threshold_pos: float = 0.05, use_mj_passive_viewer: bool = False)[source]

Bases: BasicMagBotSingleAgentEnv

A simple object pushing environment with two static obstacles.

Parameters:
  • obstacle_mode – the obstacle mode of this environment which determines the size of the obstacles and the difficulty of the task, defaults to ‘simple’. Can be ‘hard’ (large obstacles), ‘medium’ (medium-sized obstacles), ‘simple’ (small obstacles), ‘random’ (choose size of the obstacles at random), or ‘curriculum’ (increase the size of the obstacle using a curriculum).

  • mover_params

    Dictionary specifying mover properties. If None, default values are used. Supported keys:

    • mass (float | numpy.ndarray): Mass in kilograms. Options:
      • Single float: Same mass for all movers

      • 1D array (num_movers,): Individual masses per mover

    Default: 1.24 [kg]

    • shape (str | list[str]): Mover shape type. Must be one of:
      • ’box’: Rectangular cuboid

      • ’cylinder’: Cylindrical shape

      • ’mesh’: Custom 3D mesh

      Default: ‘box’

    • size (numpy.ndarray): Shape dimensions in meters. Format depends on shape:
      • For ‘box’: Half-sizes (x, y, z)

      • For ‘cylinder’: (radius, height, _)

      • For ‘mesh’: Computed from mesh dimensions multiplied by scale factors in mesh[‘scale’]

      Specification options: - 1D array (3,): Same size for all movers - 2D array (num_movers, 3): Individual sizes per mover

      Default: [0.155/2, 0.155/2, 0.012/2] [m]

    • mesh (dict): Configuration for mesh-based shapes. Required when shape=’mesh’. Contains:
      • mover_stl_path (str): Path to mover mesh STL file or one of the predefined meshes:
        • ’beckhoff_apm4330_mover’: Beckhoff APM4220 mover mesh (default)

        • ’beckhoff_apm4220_mover’: Beckhoff APM4220 mover mesh

        • ’beckhoff_apm4550_mover’: Beckhoff APM4550 mover mesh

        • ’planar_motor_M3-06’: Planar Motor M3-06 mover mesh

        • ’planar_motor_M3-15’: Planar Motor M3-15 mover mesh

        • ’planar_motor_M3-25’: Planar Motor M3-25 mover mesh

        • ’planar_motor_M4-11’: Planar Motor M4-11 mover mesh

        • ’planar_motor_M4-18’: Planar Motor M4-18 mover mesh

      • bumper_stl_path (str | None): Path to bumper mesh STL file or one of the predefined meshes:
        • ’beckhoff_apm4330_bumper’: Beckhoff APM4330 bumper mesh (default)

        • ’beckhoff_apm4220_bumper’: Beckhoff APM4220 bumper mesh

        • ’beckhoff_apm4550_bumper’: Beckhoff APM4550 bumper mesh

      • bumper_mass (float | numpy.ndarray): Bumper mass in kilograms. Can be specified as:
        • Single float: Same mass applied to all bumpers

        • 1D array (num_movers,): Individual masses for each bumper

        Default: 0.1 [kg]

      • scale (numpy.ndarray): Scale factors for mesh dimensions (x, y, z). Multiplied with the

        mesh geometry. Specification options: - 1D array (3,): Same scale factors applied to all movers - 2D array (num_movers, 3): Individual scale factors for each mover

        Default: [1.0, 1.0, 1.0] (no scaling)

    • material (str | list[str]): Material name to apply to the mover. Can be specified as:
      • Single string: Same material for all movers

      • List of strings: Individual materials for each mover

      Default: “gray” for movers without goals, color-coded materials for movers with goals

    Note: Custom mesh STL files must have their origin at the mover’s center.

  • initial_mover_zpos – the initial distance between the bottom of the mover and the top of a tile, defaults to 0.002 [m]

  • std_noise – the standard deviation of a Gaussian with zero mean used to add noise, defaults to 1e-5. The standard deviation can be used to add noise to the mover’s position, velocity and acceleration. If you want to use different standard deviations for position, velocity and acceleration use a numpy array of shape (3,); otherwise use a single float value, meaning the same standard deviation is used for all three values.

  • render_mode – the mode that is used to render the frames (‘human’, ‘rgb_array’ or None), defaults to ‘human’. If set to None, no viewer is initialized and used, i.e. no rendering. This can be useful to speed up training.

  • render_every_cycle – whether to call render() after each integrator step in the step() method, defaults to False. Rendering every cycle leads to a smoother visualization of the scene, but can also be computationally expensive. Thus, this parameter provides the possibility to speed up training and evaluation. Regardless of this parameter, the scene is always rendered after num_cycles have been executed if render_mode != None.

  • num_cycles – the number of control cycles for which to apply the same action, defaults to 40

  • collision_params – _description_, defaults to None

  • v_max – the maximum velocity, defaults to 2.0 [m/s]

  • a_max – the maximum acceleration, defaults to 10.0 [m/s²]

  • j_max – the maximum jerk (only used if learn_jerk=True), defaults to 100.0 [m/s³]

  • learn_jerk – whether to learn the jerk, defaults to False. If set to False, the acceleration is learned, i.e. the policy output.

  • threshold_pos – the position threshold used to determine whether a mover has reached its goal position, defaults to 0.05 [m]

  • use_mj_passive_viewer – whether the MuJoCo passive_viewer should be used, defaults to False. If set to False, the Gymnasium MuJoCo WindowViewer with custom overlays is used.

_after_mujoco_step_callback()[source]

Check whether corrective movements (overshoot or distance corrections) occurred and increase the corresponding counter if necessary.

_before_mujoco_step_callback(action: ndarray) None[source]

Apply the next action, i.e. it sets the jerk or acceleration, ensuring the minimum and maximum velocity and acceleration (for one cycle).

Parameters:

action – a numpy array of shape (num_movers * 2,), which specifies the next action (jerk or acceleration)

_check_for_other_collisions_callback()[source]

Check whether the object or mover collide with an obstacle.

Returns:

  • whether there is a collision (mover-obstacle or object-obstacle, bool)

  • a dictionary with keys obstacle_object_collision and obstacle_mover_collision that contains bool values indicating what type of collision occurred (mover-obstacle or object-obstacle)

_custom_xml_string_callback(custom_model_xml_strings: dict | None) dict[str, str][source]

For each mover, this callback adds actuators to the custom_model_xml_strings-dict, depending on whether the jerk or acceleration is the output of the policy.

Parameters:

custom_model_xml_strings – the current custom_model_xml_strings-dict which is modified by this callback

Returns:

the modified custom_model_xml_strings-dict

_reset_callback(options: dict[str, Any] | None = None) None[source]

Reset the start position of mover and object and the object goal position and reload the model. It is ensured that the new start position of the mover is collision-free, i.e. no wall collision and no collision with the object. In addition, the object’s start position is chosen such that the mover fits between the wall and the object. This is important to ensure that the object can be pushed in all directions.

It is also possible to explicitly specify start positions for mover and object, as well as the object’s goal position by using the keys ‘mover_xy_start_pos’, ‘object_xy_start_pos’, and ‘object_xy_goal_pos’ in the options-dict (each numpy arrays of shape (2,)). The start position of mover and object are sampled uniformly at random if the corresponding key is not in the options-dict. Note that there is no check whether mover or object collide with an obstacle due to computational efficiency. Since min and max x,y start positions of object and mover are chosen such that a collision with the obstacle cannot occur if the start positions are randomly sampled. In addition, if the object’s start position is specified by the user, the distance to the mover is not checked.

Parameters:

options – can be used to specify start positions for mover and object and a goal position for the object (keys: ‘mover_xy_start_pos’, ‘object_xy_start_pos’, ‘object_xy_goal_pos’)

close() None[source]

Close the environment.

compute_reward(achieved_goal: ndarray, desired_goal: ndarray, info: dict[str, Any] | None = None) ndarray | float[source]

Compute the immediate reward.

Parameters:
  • achieved_goal – a numpy array of shape (batch_size, length achieved_goal) or (length achieved_goal,) containing the already achieved (x,y)-positions of an object

  • desired_goal – a numpy array of shape (batch_size, length desired_goal) or (length desired_goal,) containing the (x,y) goal positions of an object

  • info – a dictionary containing auxiliary information, defaults to None

Returns:

a single float value or a numpy array of shape (batch_size,) containing the immediate rewards

compute_terminated(achieved_goal: ndarray, desired_goal: ndarray, info: dict[str, Any] | None = None) ndarray | bool[source]

Check whether a terminal state is reached. A state is terminal when the mover or object collides with a wall or the obstacle.

Parameters:
  • achieved_goal – a numpy array of shape (batch_size, length achieved_goal) or (length achieved_goal,) containing the already achieved (x,y)-positions of an object

  • desired_goal – a numpy array of shape (batch_size, length desired_goal) or (length desired_goal,) containing the (x,y) goal positions of an object

  • info – a dictionary containing auxiliary information, defaults to None

Returns:

  • if batch_size > 1:

    a numpy array of shape (batch_size,). An entry is True if the state is terminal, False otherwise

  • if batch_size = 1:

    True if the state is terminal, False otherwise

compute_truncated(achieved_goal: ndarray, desired_goal: ndarray, info: dict[str, Any] | None = None) ndarray | bool[source]

Check whether the truncation condition is satisfied. The truncation condition (a time limit in this environment) is automatically checked by the Gymnasium TimeLimit Wrapper, which is why this method always returns False.

Parameters:
  • achieved_goal – a numpy array of shape (batch_size, length achieved_goal) or (length achieved_goal,) containing the already achieved (x,y)-positions of an object

  • desired_goal – a numpy array of shape (batch_size, length desired_goal) or (length desired_goal,) containing the (x,y) goal positions of an object

  • info – a dictionary containing auxiliary information, defaults to None

Returns:

  • if batch_size > 1:

    a numpy array of shape (batch_size,) in which all entries are False

  • if batch_size = 1:

    False

ensure_max_dyn_val(current_values: ndarray, max_value: float, next_derivs: ndarray) tuple[ndarray, ndarray][source]

Ensure the minimum and maximum dynamic values.

Parameters:
  • current_values – the current velocity or acceleration specified as a numpy array of shape (2,) or (num_checks,2)

  • max_value – the maximum velocity or acceleration (float)

  • next_derivs – the next derivative (acceleration or jerk) used for one integrator step specified as a numpy array of shape (2,) or (num_checks,2)

Returns:

the next velocity or acceleration and the next derivative (acceleration or jerk) corresponding to the next action that must be applied to ensure the minimum and maximum dynamics (each of shape (num_checks,2))

get_current_num_distance_corrections() int[source]

Return the current number of distance corrections measured within the current episode.

Returns:

the current number of distance corrections measured within the current episode

get_current_num_overshoot_corrections() int[source]

Return the current number of overshoot corrections measured within the current episode.

Returns:

the current number of overshoot corrections measured within the current episode

reload_model(mover_start_xy_pos: ndarray | None = None) None[source]

Generate a new model XML string with new start positions for mover and object and a new object goal position and reload the model. In this environment, it is necessary to reload the model to ensure that the actuators work as expected.

Parameters:

mover_start_xy_pos – None or a numpy array of shape (num_movers,2) containing the (x,y) starting positions of each mover, defaults to None. If set to None, the movers will be placed in the center of the tiles that are added to the XML string first.

update_cached_actuator_mujoco_data() None[source]

Update all cached information about MuJoCo actuators, such as names and ids.