Skip to Content
DocsData Engine

Data Engine

Robots deployed in the real world generate massive amounts of sensor data, telemetry, and operational logs. The Data Engine collects, processes, and stores this data, then feeds it back into training pipelines to continuously improve robot capabilities.

Architecture Overview

The system consists of several interconnected components that work together to create a complete data pipeline:

1. Data Collection Layer

Control & State Server

The Control & State Server acts as the central hub for robot communication:

  • gRPC Control Interface: Receives control commands from applications
  • gRPC State Interface: Streams robot state information to monitoring systems
  • Prometheus Integration: Exports metrics to Prometheus server for real-time monitoring via Grafana Dashboard

WebRTC SFU (Selective Forwarding Unit)

Handles real-time media streaming from robots:

  • Video Streaming: Captures video feeds from robot cameras via WebRTC
  • Audio Streaming: Captures audio data from robot microphones via WebRTC
  • Low Latency: Provides real-time media transmission for teleoperation and monitoring

2. Data Processing Layer

Pre-processing Services

Raw data from robots undergoes preprocessing before storage:

  • Video Pre-processing: Handles video encoding, frame extraction, and format conversion
  • Audio Pre-processing: Processes audio streams for storage and analysis

Data Recording Services

Centralized service that aggregates all data streams:

  • Receives processed video and audio data
  • Collects control and state information from the Control & State Server
  • Timestamps and synchronizes multi-modal data
  • Prepares data for storage in standardized formats

3. Storage Layer

Data Chunks

Processed data is organized into chunks for efficient storage and retrieval:

  • Time-based segmentation of continuous data streams
  • Metadata tagging for easy querying
  • Optimized for both sequential access and random sampling

Data Hooks

Extensible hook system for data pipeline integration:

  • Trigger downstream processing on new data arrival
  • Enable custom data transformations
  • Support for real-time and batch processing workflows

Storage Bucket (Document Store)

Cloud-based object storage for long-term data persistence:

  • Scalable storage for large volumes of robot data
  • Support for various data formats (video, audio, telemetry, logs)
  • Versioning and lifecycle management
  • Cost-effective archival storage

4. Data Blobs for Training

Data blobs serve as the interface between stored data and training pipelines:

  • Curated Datasets: Selected and labeled data ready for training
  • Efficient Access: Optimized data loading for training workloads
  • Version Control: Track dataset versions across training experiments

5. Model Training & Evaluation Pipeline

Vision-Language-Action Model Training Pipeline

The core training infrastructure for robotic foundation models:

  • Multi-modal Learning: Trains on vision, language, and action data simultaneously
  • Distributed Training: Scales across multiple GPUs/TPUs
  • Experiment Tracking: Logs metrics, hyperparameters, and model artifacts
  • Data Blobs Integration: Directly consumes curated data from storage

Model Checkpoints

Persistent storage for trained model weights:

  • Regular checkpoint saving during training
  • Best model selection based on validation metrics
  • Model versioning for reproducibility

Sim2Real Pipeline

Bridges the gap between simulation and real-world deployment:

  • Test Deployment: Validates models in simulation before real-world testing
  • Model Evaluation in Simulation: Assesses model performance in controlled environments
  • Iterative Refinement: Feeds evaluation results back into training

6. Deployment & Evaluation Loop

Model Evaluation in Real

Real-world testing of trained models:

  • Over-the-air Updates: Deploys models to physical robots via Model Version Control
  • Performance Monitoring: Tracks success rates, failure modes, and edge cases
  • Data Collection: Gathers new data from real-world deployments

Continuous Improvement Loop

The system creates a flywheel effect for model improvement:

  1. Real-world Evaluation → Identifies failure cases and collects new data
  2. Re-training → Incorporates new data into training pipeline
  3. Simulation Testing → Validates improvements in controlled environment
  4. Deployment → Pushes improved models back to robots
Last updated on