Data Engine

Robots deployed in the real world generate massive amounts of sensor data, telemetry, and operational logs. The Data Engine collects, processes, and stores this data, then feeds it back into training pipelines to continuously improve robot capabilities.

Architecture Overview

The system consists of several interconnected components that work together to create a complete data pipeline:

1. Data Collection Layer

Control & State Server

The Control & State Server acts as the central hub for robot communication:

gRPC Control Interface: Receives control commands from applications
gRPC State Interface: Streams robot state information to monitoring systems
Prometheus Integration: Exports metrics to Prometheus server for real-time monitoring via Grafana Dashboard

WebRTC SFU (Selective Forwarding Unit)

Handles real-time media streaming from robots:

Video Streaming: Captures video feeds from robot cameras via WebRTC
Audio Streaming: Captures audio data from robot microphones via WebRTC
Low Latency: Provides real-time media transmission for teleoperation and monitoring

2. Data Processing Layer

Pre-processing Services

Raw data from robots undergoes preprocessing before storage:

Video Pre-processing: Handles video encoding, frame extraction, and format conversion
Audio Pre-processing: Processes audio streams for storage and analysis

Data Recording Services

Centralized service that aggregates all data streams:

Receives processed video and audio data
Collects control and state information from the Control & State Server
Timestamps and synchronizes multi-modal data
Prepares data for storage in standardized formats

3. Storage Layer

Data Chunks

Processed data is organized into chunks for efficient storage and retrieval:

Time-based segmentation of continuous data streams
Metadata tagging for easy querying
Optimized for both sequential access and random sampling

Data Hooks

Extensible hook system for data pipeline integration:

Trigger downstream processing on new data arrival
Enable custom data transformations
Support for real-time and batch processing workflows

Storage Bucket (Document Store)

Cloud-based object storage for long-term data persistence:

Scalable storage for large volumes of robot data
Support for various data formats (video, audio, telemetry, logs)
Versioning and lifecycle management
Cost-effective archival storage

4. Data Blobs for Training

Data blobs serve as the interface between stored data and training pipelines:

Curated Datasets: Selected and labeled data ready for training
Efficient Access: Optimized data loading for training workloads
Version Control: Track dataset versions across training experiments

5. Model Training & Evaluation Pipeline

Vision-Language-Action Model Training Pipeline

The core training infrastructure for robotic foundation models:

Multi-modal Learning: Trains on vision, language, and action data simultaneously
Distributed Training: Scales across multiple GPUs/TPUs
Experiment Tracking: Logs metrics, hyperparameters, and model artifacts
Data Blobs Integration: Directly consumes curated data from storage

Model Checkpoints

Persistent storage for trained model weights:

Regular checkpoint saving during training
Best model selection based on validation metrics
Model versioning for reproducibility

Sim2Real Pipeline

Bridges the gap between simulation and real-world deployment:

Test Deployment: Validates models in simulation before real-world testing
Model Evaluation in Simulation: Assesses model performance in controlled environments
Iterative Refinement: Feeds evaluation results back into training

6. Deployment & Evaluation Loop

Model Evaluation in Real

Real-world testing of trained models:

Over-the-air Updates: Deploys models to physical robots via Model Version Control
Performance Monitoring: Tracks success rates, failure modes, and edge cases
Data Collection: Gathers new data from real-world deployments

Continuous Improvement Loop

The system creates a flywheel effect for model improvement:

Real-world Evaluation → Identifies failure cases and collects new data
Re-training → Incorporates new data into training pipeline
Simulation Testing → Validates improvements in controlled environment
Deployment → Pushes improved models back to robots