Communication Protocol

As introduced in the Quick Start section through the use of fake_port, the platform employs a socket-based communication mechanism to exchange structured data between the simulation controller and the inference model. You can extend the example in fake_port to integrate your own model controller for benchmark evaluation.

Outgoing Data

At each simulation timestep, the system sends the following structured data through a socket. Note that in practice, you might receive additional fields — those are deprecated and will be removed in future releases.

data = {
    "camera_data": camera_data, // Dict, camera observations
    "instruction": instruction, // String, task instruction
    "joint_position_state": joint_position_state, // np.ndarray, current joint positions
    "ee_pose_state": ee_pose_state, // list[np.ndarray] or list[list[np.ndarray]], end-effector pose(s)
    "timestep": step, // Int, current simulation step
    "reset": reset, // Bool, whether to reset the environment or model
}

Field Definitions

Field Name	Type	Description
`camera_data`	dict	Data captured from the robot-mounted cameras.
`instruction`	string	The natural language command or task instruction given to the agent.
`joint_position_state`	`np.ndarray` (shape `(9,)`)	Current joint angles of the robot arm (e.g., Franka).
`ee_pose_state`	list[`np.ndarray`] or list[list[`np.ndarray`]]	End-effector pose(s) of the robot in the robot frame. For single-arm robots, this is a list of two arrays — translation (3D) and orientation (4D, scalar-first quaternion). For dual-arm robots, it is a nested list for left and right arms, each with its own translation and orientation.
`timestep`	int	The current simulation step index in the rollout.
`reset`	bool	Whether the environment or model should reset.

The model or controller must then respond with a structured action message.

Camera Data Structure

Each entry in the camera_data dictionary contains the following fields:

Field Name	Type	Description
`p`	`np.ndarray` (shape `(3,)`)	Camera position in world coordinates.
`q`	`np.ndarray` (shape `(4,)`)	Camera orientation in world coordinates (quaternion, scalar-first).
`rgb`	`np.ndarray`	RGB image array. The resolution depends on the configuration in `configs/cameras/`.
`depth`	`np.ndarray`	Depth image array.
`intrinsics_matrix`	`np.ndarray` (shape `(3, 3)`)	Camera intrinsic matrix.

Returned Action

The model must return an action dictionary, which can represent either joint positions or end-effector (ee) poses.

Joint Position Mode

Currently, delta joint positions are supported. If your model outputs absolute joint positions, you should convert them to deltas as shown below:

while True:
    data = wait_message(receive_socket)
    if data["reset"]:
        last_joint_position = data["joint_position_state"]
    processed_data = process_data(data)
    action = model.inference(processed_data)
    delta_joint_position[:7] = action[:7] - last_joint_position[:7]
    last_joint_position = action
    action = {"action": delta_joint_position}
    send_message(send_socket, action)

Supported Formats

Franka with Panda Hand
- Array of length 9
- First 7 elements: delta arm joint positions
- Last 2 elements: gripper control
- [0.04, 0.04] = fully open, [0.0, 0.0] = fully closed
Franka with RoboTiq Hand
- Array of length 13
- First 7 elements: delta arm joint positions
- Last 6 elements: gripper control
- [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] = fully open
- [0.7853, 0.7853, -0.7853, -0.7853, -0.7853, -0.7853] = fully closed
Aloha
- Array of length 16
- 0:6 and 8:14: delta joint positions of left and right arms
- 6:8 and 14:16: gripper control
- [0.05, 0.05] = fully open, [0.0, 0.0] = fully closed

End-Effector Pose Mode

In this mode, delta ee poses are supported. If your model outputs absolute poses, convert them into deltas accordingly.

Supported Formats

Franka with Panda Hand
- Tuple of length 3:
  - 3D translation,
  - 4D quaternion orientation (scalar-first),
  - 2D gripper control ([0.04, 0.04] = open, [0.0, 0.0] = closed)
Franka with RoboTiq Hand
- Tuple of length 3:
  - 3D translation,
  - 4D quaternion orientation (scalar-first),
  - 6D gripper control ([0.0,...] = open, [0.7853,...] = closed)
Aloha
- Tuple of length 2: for left and right arms
- Each arm’s tuple includes:
  - 3D translation,
  - 4D quaternion orientation (scalar-first),
  - 2D gripper control ([0.05, 0.05] = open, [0.0, 0.0] = closed)