Skip to content

Communication Protocol

As introduced in the Quick Start section through the use of fake_port, the platform employs a socket-based communication mechanism to exchange structured data between the simulation controller and the inference model. You can extend the example in fake_port to integrate your own model controller for benchmark evaluation.

Outgoing Data

At each simulation timestep, the system sends the following structured data through a socket. Note that in practice, you might receive additional fields — those are deprecated and will be removed in future releases.

data = {
"camera_data": camera_data, // Dict, camera observations
"instruction": instruction, // String, task instruction
"joint_position_state": joint_position_state, // np.ndarray, current joint positions
"ee_pose_state": ee_pose_state, // list[np.ndarray] or list[list[np.ndarray]], end-effector pose(s)
"timestep": step, // Int, current simulation step
"reset": reset, // Bool, whether to reset the environment or model
}

Field Definitions

Field NameTypeDescription
camera_datadictData captured from the robot-mounted cameras.
instructionstringThe natural language command or task instruction given to the agent.
joint_position_statenp.ndarray (shape (9,))Current joint angles of the robot arm (e.g., Franka).
ee_pose_statelist[np.ndarray] or list[list[np.ndarray]]End-effector pose(s) of the robot in the robot frame. For single-arm robots, this is a list of two arrays — translation (3D) and orientation (4D, scalar-first quaternion). For dual-arm robots, it is a nested list for left and right arms, each with its own translation and orientation.
timestepintThe current simulation step index in the rollout.
resetboolWhether the environment or model should reset.

The model or controller must then respond with a structured action message.

Camera Data Structure

Each entry in the camera_data dictionary contains the following fields:

Field NameTypeDescription
pnp.ndarray (shape (3,))Camera position in world coordinates.
qnp.ndarray (shape (4,))Camera orientation in world coordinates (quaternion, scalar-first).
rgbnp.ndarrayRGB image array. The resolution depends on the configuration in configs/cameras/.
depthnp.ndarrayDepth image array.
intrinsics_matrixnp.ndarray (shape (3, 3))Camera intrinsic matrix.

Returned Action

The model must return an action dictionary, which can represent either joint positions or end-effector (ee) poses.

Joint Position Mode

Currently, delta joint positions are supported. If your model outputs absolute joint positions, you should convert them to deltas as shown below:

while True:
data = wait_message(receive_socket)
if data["reset"]:
last_joint_position = data["joint_position_state"]
processed_data = process_data(data)
action = model.inference(processed_data)
delta_joint_position[:7] = action[:7] - last_joint_position[:7]
last_joint_position = action
action = {"action": delta_joint_position}
send_message(send_socket, action)

Supported Formats

  • Franka with Panda Hand

    • Array of length 9
    • First 7 elements: delta arm joint positions
    • Last 2 elements: gripper control
    • [0.04, 0.04] = fully open, [0.0, 0.0] = fully closed
  • Franka with RoboTiq Hand

    • Array of length 13
    • First 7 elements: delta arm joint positions
    • Last 6 elements: gripper control
    • [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] = fully open
    • [0.7853, 0.7853, -0.7853, -0.7853, -0.7853, -0.7853] = fully closed
  • Aloha

    • Array of length 16
    • 0:6 and 8:14: delta joint positions of left and right arms
    • 6:8 and 14:16: gripper control
    • [0.05, 0.05] = fully open, [0.0, 0.0] = fully closed

End-Effector Pose Mode

In this mode, delta ee poses are supported. If your model outputs absolute poses, convert them into deltas accordingly.

Supported Formats

  • Franka with Panda Hand

    • Tuple of length 3:

      • 3D translation,
      • 4D quaternion orientation (scalar-first),
      • 2D gripper control ([0.04, 0.04] = open, [0.0, 0.0] = closed)
  • Franka with RoboTiq Hand

    • Tuple of length 3:

      • 3D translation,
      • 4D quaternion orientation (scalar-first),
      • 6D gripper control ([0.0,...] = open, [0.7853,...] = closed)
  • Aloha

    • Tuple of length 2: for left and right arms

    • Each arm’s tuple includes:

      • 3D translation,
      • 4D quaternion orientation (scalar-first),
      • 2D gripper control ([0.05, 0.05] = open, [0.0, 0.0] = closed)