Tech Lead/PM	Justin Lin & Nathan Reilly
GitHub	https://github.com/uwrealitylabs/universal-text-unity
Scrum Board	https://www.notion.so/uwrl/1f9bc072402f8056b481d64fa56b4ef5?v=1f9bc072402f81b7a7ca000c4e8d9e22&pvs=4
Expected Delivery	July 2025

Changes to Spec:

Change Date	Change Author	Change Reason
Aug. 17, 2024	Justin Lin	Initial Author
Aug. 31, 2024	Nathan Reilly & Justin Lin	Technical revisions for Text Label composition. Added Introduction
Sep 27, 2024	Nathan Reilly	Large-scale revision of the implementation
Jan 20, 2025	Nathan Reilly	Revision of the UTT and UTS implementation & other updates for W25

Point Persons:

Role	Name	Contact Info
Sedra Lead	Peter Teertstra	[email protected]
Team Lead	Justin Lin
Nathan Reilly	[email protected]
[email protected]
UW Reality Labs Leads	Vincent Xie
Kenny Na
Justin Lin	[email protected]
[email protected]
[email protected]

Google docs version of tech spec here.

Introduction

When you prompt a virtual assistant (for example Meta AI on Raybans glasses), what happens when you ask “What am I looking at”? Currently, the pipeline seems rather simplistic. The cameras on the glasses take a picture, that picture is passed through a model that can assign text labels to images, and finally that text label describing the whole image is passed into an LLM. This process, especially the step where a model must describe everything in an image using words, is often inaccurate.

What if we could build a system that…

…provides a richer text summary of a virtual environment, complete with descriptions of how objects compose each other, are placed within/next to/on top of each other?
…also describes how you, the user, is interacting with that environment at any moment? Could we assign additional text to describe that you are pointing at a specific object, or reaching out for one?
…runs in real time, that is, can constantly update every frame to provide an updated description. That way, we wouldn’t have to wait for text generation, and we could create a live captioning system?
…runs entirely on-device, meaning this information is never sent to the cloud?

If we created this, we could use it for…

…in-application virtual assistants that make use of a rich text summary for high-accuracy responses
…virtual science labs where users could receive detailed auto-generated scientific explanations about tools and objects they interact with

Table of Contents

Introduction