From Idea To Innovation - Part 2

This is part of a series discussing how to turn an idea into reality. Part 1 can be found here.

Last time we left off with the idea of translating latent spaces to bring zero-shot classification capability to edge processing devices. Today we’ll explore how we do that at Technology Readiness Level (TRL) 3 by building a proof-of-concept model.

TRL 3: Proof of concept

Taking the core of an idea and making it exist is a critical step. This is the moment to crystalise the concept and see if there’s anything worth pursuing.

TRL3 is one of my favourite steps (that and TRL6). It’s the first part where you can really go wrong (important if you’re trying to fail fast), so it’s the first true test of whether there is anything to your idea. Later stages get more in depth and integrated which adds value, but the clean simplicity of asking ‘can this work?’ is quite addictive.

So if we can fail this step, what is the test we’re setting? In very simple terms, we have to ask does a simple implementation of my idea seem to produce expected results in near ideal conditions. This is quite vague for a reason - we might not know exactly what to expect, or what ideal conditions are, and we definitely haven’t got a good implementation yet.

So in order to answer that question, we need three things: an input, an implementation, and an expected output. Two of those three things are easy for us:

  • Input: A MobileNet (MN) feature vector (FV) for an image and some text strings to classify against.

  • Expected Output: Classification confidence similar to those produced by CLIP.

For our purposes, we can select any input image, choose some labels, and run it through MN for the input and CLIP to get the expected output. At this stage, that’s good enough. Later we can think about measuring before the difference between expected and actual outputs, and what how sensitive it is to the inputs.

Diagrame showing how we take an input image and text strings, convert them to MNv2 and CLIP feature vectors, then compare them to get the class confidences for each model.

But! We still need the actual implementation. At this stage, we need something that is simple, quick to create, but good enough to feasibly do what we want. You don’t want to spend forever on the perfect implementation only to find that it simply doesn’t work (i.e. fail slow), but equally testing an implementation that cannot possibly work is a waste of time and may falsely put you off the whole concept.

I find that 30 minutes to an hour is a good amount of time for a simple proof of concept - if you can’t do it in that time, you may need to simplify your idea or build some supporting tooling to make implementation easier. This tooling is important, as it will allow us to rapidly iterate and experiment. For me, that means software libraries (mine or open source) and datasets that I can pick up and reuse with ease. I don’t want to work out how to process and save video every project - I just pick up a known-good baseline and add my idea into it.

So - how are we going to do translate MN feature vectors into the CLIP feature space? My first thought was a sort of autoencoder approach, so I put together a simple feed forward pytorch model to encode the MN features and then decode them to CLIP. I later realised this was overkill (as the MN feature vector is already an encoded representation of an image), and simplified it to just an encoder model.

A simple diagram showing the feed forward layer sizes. All layers are Linear transforms with ReLu activation, apart from the final layer which uses Sigmoid activation instead.

For training, we again keep it simple - our loss metric will be the mean squared error between the predicted FV and the true CLIP FV. We could be clever and design a metric based on how we’re going to use it, but this is good enough for now. I’m using a sigmoid layer on the end for training simplicity, which does mean that I need to correctly scale my outputs when we’re using it.

A diagram showing how the image is passed into two existing encoders to extract their feature vectors, and then we convert one to the other and compare them to provide our training loss.

I picked COCO 2017 images as my training set, for no particular reason other than it’s varied and I have it availble. I trained on the Test set (40k images) and tested on the Val set (5k images); again, as it’s a proof of concept we don’t want to spend ages on each epoch, so the smaller sets are fine. I used standard, known good optimiser settings and kicked it off. I ran it a couple of times, changing the learning rate and monitoring the loss curves, but not stringently measuring or testing.

Examples of images from the COCO 2017 Val set.

Now time to see if it works! We’ll just adapt the CLIP example code to use our tanslated MN FV instead of the CLIP one.

Zero shot classification performance using CLIP and encoded MobileNet v2 on an image from the COCO 2017 Val set.

On an image from our Test set, we can see that the classification results are nearly identical, suggesting that our generalisation on the COCO set is ok.

Zero shot classification performance using CLIP and encoded MobileNet v2 on an image of New York City. [Credit to Jp Valery]

Looking at an image completely outside of COCO we again get comparable results, although MN is showing slightly less confidence than CLIP.

Zero shot classification performance using CLIP and encoded MobileNet v2 on an image generated by DALL-E.

And in an example where the results differ, the result is still semantically correct compared to the available classes.

These pictures show that we can achieve a zero-shot classification capability using MN that is in similar in use to CLIP - our concept is proven to have some potential! Just to check that it’s not a fluke, let’s try simply using the first 512 elements of the MN FV to see if there’s any correlation:

Zero shot classification performance using CLIP, encoded MobileNet v2, and raw MobileNet v2 [:512] on an image from the COCO 2017 Val set.

Nope, our encoder really does seem to have learned has to translate between latent spaces.

We can’t really improve until we know how good our current model is, which means measuring performance. We can note ideas down and even code them up, but we can’t be sure it’s better or worse. Next time, we’ll look at how we can start to quantify performance and then experiment.

Previous
Previous

From Idea to Innovation: Part 3

Next
Next

From Idea to Innovation - Part 1