Communication Protocol

As introduced in the Quick Start using fake_port, the platform uses socket-based communication to exchange structured data between the simulation controller and the inference module. You can extend fake_port by integrating your own model controller to conduct benchmarking.

Outgoing Data

At each simulation timestep, the following structured data is sent through the socket:

data = {
    "camera_data": camera_data,
    "instruction": instruction,
    "joint_position_state": joint_position_state,
    "franka_hand_pose": franka_hand_pose,
    "franka_pose": franka_pose,
    "timestep": step,
    "reset": reset,
}

Field Definitions

Field Name	Type	Description
`camera_data`	dict	Data collected from robot-mounted cameras.
`instruction`	string	Natural language instruction or task prompt for the agent.
`joint_position_state`	`np.ndarray` (shape: `(9,)`)	Current joint positions of the robot arm (e.g., Franka).
`franka_hand_pose`	tuple(`np.ndarray`, `np.ndarray`)	World-frame pose of the Franka hand: position `(3,)`, quaternion `(4,)` (scalar-first).
`franka_pose`	tuple(`np.ndarray`, `np.ndarray`)	World-frame pose of the Franka base.
`timestep`	int	Current timestep index during rollout.
`reset`	bool	Whether to reset the environment or the model.

This data is received by the model or controller, which is expected to return a structured action message in response.

Camera Data Structure

Each camera_data dictionary entry includes the following fields per camera:

Field Name	Type	Description
`p`	`np.ndarray` (shape: `(3,)`)	Camera position in world coordinates.
`q`	`np.ndarray` (shape: `(4,)`)	Camera orientation as a quaternion (scalar-first) in world coordinates.
`rgb`	`np.ndarray`	RGB image array. Resolution depends on configuration in `configs/cameras/`.
`depth`	`np.ndarray`	Depth image array.
`intrinsics_matrix`	`np.ndarray` (shape: `(3, 3)`)	Camera intrinsics matrix.

Incoming Data (Returned Action)

The returned data should be an action dictionary. Currently, only Franka robots are supported. The expected return is a 9-dimensional array representing joint control:

First 7 elements: Joint angle deltas (in radians). If your model outputs absolute joint positions, you must convert them to deltas. Example:

while True:
    data = wait_message(receive_socket)
    if data["reset"]:
        last_joint_position = data["joint_position_state"]
    processed_data = process_data(data)
    action = model.inference(processed_data)
    delta_joint_position[:7] = action[:7] - last_joint_position[:7]
    last_joint_position = action
    action = {"action": delta_joint_position}
    send_message(send_socket, action)

Last 2 elements: Gripper width control. [0.04, 0.04] represents fully open, and [0.0, 0.0] represents fully closed.

Return in format：

action = {"action": nine-dimensional-array}