Today we’re starting a blog series where we show our process of taking an idea from inception through to a working field demo.

Technological maturity is often measured using Technology Readiness Levels (TRLs). We align to the DASA definitions, but in broad handfuls:

TRL 1-2: Idea and formulation (e.g. what if computers could talk to each other?)
TRL 3-4: Proof of concept demonstration (I can send a specific signal to the PC next to this one)
TRL 5-6: Field tested prototype (I can send arbitrary messages to a colleague in the next town)
TRL 7-8: Operationally tested and assured (People I don’t know are using it and have set up standards)
TRL 9: Used operationally. (Anyone can use it!)

The step between TRL5-6 and TRL7-8 is often referred to as the ‘valley of death’, as that’s where good ideas end without sustained effort (and good luck).

Today we’ll discuss TRL1-2 for my current idea: adding zero-shot classification capabilities to edge processing models (specifically MobileNet v2).

TRL1: The idea.

Every good idea needs a spark. A flash of inspiration, an annoying problem, a golden opportunity – something to connect two thoughts with something new. Most often they go nowhere, but by thinking them through we can gain understanding we didn’t have before.

I’ve been fascinated with reasoning AIs since the release of CLIP in January 2021. My previous post discussed the amazing generative models enabled by this development, but even the original CLIP is incredible – we have essentially given vision AIs the ability to reason on the fly, and go beyond simply classifying an image as one of a fixed set of categories. Before, models needed examples of the things they could classify (e.g. lots of pictures of cats and dogs). CLIP introduced the ability to compare pictures to text to see how similar they are, allowing us to use the model for much more without having to explicitly train it.

*CLIP zero-shot image classification outputs for the set of phrases “An open road”, “A busy road”, and “A blocked road”. Photos by* *Pixabay*, *Annie Spratt, and* *Joeseph Cooper.*

This sort of reasoning is critical to autonomous systems. Current modular ‘building block’ development approaches are good for discrete integration (i.e. this block can detect people, cars, and road signs) but lack the flexibility required for highly autonomous capability.

The real world is far too complex to model every eventuality, as shown by these examples of ‘Blocked’ road incidents from the VIDI dataset of YouTube videos.

Enter zero shot models. Zero shot allows for model usage in ways that were unforeseen at training time. This is critical for autonomous systems. We can’t predict all possible ways a road could be blocked for example, but you can use your eyes and judge whether a road is passable or not without first asking ‘is there a barrier?’, ‘is there a tree?’, ‘is there a rock?’… A perfect zero shot system would allow this reasoning to be done in a single step.

The problem is that zero-shot processing currently doesn’t work at the edge, which is where it’s needed for autonomous systems. CLIP is too big to be run in real time on common edge hardware. Wouldn’t it be good if instead we could run a common edge backbone (e.g. MobileNet V2) but have zero-shot classification capability?

TRL2: The general approach.

Once we have an idea, we need to find a way to express it. Pick your medium, whether code or clay, and work out how to represent your thoughts in it. You don’t need to know exactly what it will look like, but the general form should become apparent. From here we can begin to ask questions to help us refine the shape.

Image classification models can be conceptually understood as two steps: feature extraction and classification. With normal models, classification is often a simple fully connected layer on the extracted feature vector followed by argmax on the class confidences (the logits). CLIP instead compares the image feature vector to the text label feature vectors by looking at the cosine distance between them.

*Illustration of the ‘standard’ deep learning image classification process. This requires a classifier which has to be trained to output a set number of classes.*

These feature vectors are in the latent space of the model. This is essentially the conceptual space of the model; how it represents the concepts it knows and the features that make them up. Each model’s latent space is unique to it; you can’t take a result from one model and interpret it with another. In fact, that’s the whole secret to CLIP’s success – the language and image model share latent spaces due to being trained on data that has both images and text captions.

*Illustration of the zero-shot deep learning image-text classification process. Here, the image can be compared against any text input to compute its similarity due to the shared latent space.*

But essentially, MobileNet and CLIP are doing the same thing – looking at a picture and describing its features. Surely we can translate them somehow? It’s not impossible to take this post and turn it into French or Japanese, so can we find a process that translates MobileNet encodings into CLIP encodings? There will almost certainly be loss in translation, the same way cultural context is lost when translating art, but can we get the general gist over and use MobileNet feature vectors with CLIP text features to perform zero-shot classification at the edge?

We’ll discuss more next time, but in short: yes.

From Idea to Innovation - Part 1

TRL1: The idea.

TRL2: The general approach.

From Idea To Innovation - Part 2

OL @ BattleLab