Motivations & backgrounds

In the XR field, visual odometry has long been the preferred way of estimating device locations. It's a computationally intensive task that has to be optimized for a tiny device that sits on your face.

Not only that, you have to completely localize yourself within a few milliseconds, delays should be minimized in an XR device.

image.png

image.png

The T261/T265 has 3 sensors: 2 cameras and 1 IMU. Together with their custom silicon, this tiny module is able to output stable poses without customized sensor placement.

Intel’s CEO decided to go on a side quest and make something for us (again), It is called the RealSense T261/265: decently accurate SLAM completely contained in a 22-gram package

image.png

image.png

The T261/T265 is widely used in Robotics and is especially famous in the Project North Star series headsets.

image.png

image.png

Before Intel discontinued all of this, they left us a treat that is extremely proprietary and hard to work with: https://www.intel.com/content/www/us/en/products/sku/125926/intel-movidius-myriad-x-vision-processing-unit-4gb/specifications.html

So, we wonder, what if we can make something that combines the flexibility and the power of specialized computing to make something even better? We can then offload parallelizable algorithms (corner detections, non-maximum suppression, or other image convolution algorithms) onto the FPGA with another upstream processor to handle the Kalman filters or optimization algorithms!

Currently, @Vincent Xie’s Northstar uses 2 camera modules, the T261 for SLAM and the Ultraleap SIR170 (Rigel) for hand tracking in the near IR range. We require extra computing or camera modules for any additional functions. Not very convenient at all!

What if, we combine these cameras, into one unified, hackable, and flexible system?

Isaac is so cameras rich….

@Anonymous is so camera rich…

Brainstorming

Since we wanted to guarantee specific needs can be met, our hardware is built up with these needs in mind.

graph TD;
    subgraph FPGA["FPGA"]
        MEM --> FEAT["Feature Processing"]
        IMUDATA["IMU Data"] --> PREINT["IMU Pre-integration"]
        
        subgraph Feature_Pipeline["Parallel Feature Pipeline"]
            FEAT --> FD["FAST Corner Detection"]
            FEAT --> BRIEF["BRIEF Descriptors"]
            FEAT --> MATCH["Feature Matching"]
        end
        
        MEM["Framebuffer Memory"]
        USB["USB 3 PHY"]
    end
    
    subgraph Camera_System["Camera System"]
        CAM1["Camera 1"] --> MEM
        CAM2["Camera 2"] --> MEM
        IMU["IMU"] --> IMUDATA
    end

    subgraph PC["PC"]
        BA["Bundle Adjustment"]
        LCD["Loop Closure Detection"]
        ATLAS["Atlas Management"] 
        RELOC["Relocalization"]
    end

MEM --> USB
Feature_Pipeline --> USB
    FPGA --> USB
    PREINT --> USB
    USB --> |"Features, IMU, Video Stream"| PC

Requirements & Rationales

Hardware Requirement Rationales
We need at least 2 cameras on the final board.
We need hardware capable of low latency processing of large amount of pixels.
We need an IMU on board. Assisting vision only algorithms
We need high speed output to a host device. Allowing more flexibility on usage, from simple stereo camera to all on board algorithms.
VITracker would need a large enough frame buffer to store at least 3 frames from at least 2 cameras.

Research & Notes

imx219 code reading notes

components: