This is part of a series discussing how to turn an idea into reality. Part 2 can be found here. To recap: we’re trying to achieve zero-shot image classification using a MobileNet classifier to bring this capability to low power edge computers.

Last time we concluded that we needed to start measuring performance in order to understand if our zero-shot MobileNet classifier is any good, and to know how to improve it. We need to be able to measure how good it is in both lab and field settings.

The question then becomes how do we measure performance? In their blog post, OpenAI try using CLIP against a number of image classification datasets. As we’re trying to make translate MobileNet feature vectors to CLIP’s latent space, it would make sense to recreate this and see how our implementation fares. This provides a rough lab-based validation of our basic approach. We would expect to see MobileNet give similar performance trends to CLIP but worse absolute performance.

*CLIP and ZeroShot-MobileNetv2 accuracy on a subset of the Caltech-101 dataset.*

We can see that on a subset of Caltech-101 that’s exactly what we achieve. MobileNet performance isn’t great (especially compared to CLIP) but we would expect 1% accuracy (1/101) from guessing at random, so we’re definitely outperforming that. We expect this as we’ve only trained a very simple model for a short amount of time; frankly the performance is better than we deserve given how easy it’s been!

This gives a good indication for performance on a closed set of classes, but we want to use a zero-shot classifier to recognise real world scenarios if we’re going to use it to help autonomous systems. Instead of matching a set number of classes, we can try and match captions to images. The nocaps dataset provides ~10 captions for 15k images from the Open Images Test and Validation datasets, giving a good range of imagery to work with. This is still an artificial test, but is closer to our in field use case and will be more indicative of real performance.

*Examples of images and annotations in the nocaps dataset (image copied from* *their paper*)

We can process this dataset in two ways:

As a multiclass classifier, where we rank k captions from the dataset against each image.
As a binary classifier, where we score a caption and its negative (i.e. “x” and “Not a x”) against the image. This can be done for both the actual caption and a randomly selected one.

This gives us an insight into how the system might perform in real usage. Even better, nocaps provides in-domain, near-domain, and out-of-domain labels. This means that the caption has objects from COCO, from COCO and not from COCO, and not from COCO respectively. We can use these labels to assess whether MobileNet’s performance correlates with CLIP’s as we would expect, or if we need to investigate further. We used the 4500 image validation set to start with.

*CLIP and ZeroShot-MobileNetv2 accuracy on the nocaps validation dataset.*

Performance wise MobileNet is lagging CLIP but massively better than on Caltech-101. CLIP performs very well on the multiclass (1 real caption + 30 random wrong captions) whilst binary is a little worse but still good. MobileNet achieves decent Top 1 accuracy and good Top 5, indicating that it is extracting useful features but we perhaps aren’t aligning them as well as we could. Binary performance is more interesting - CLIP drops from Top 1 performance whilst MobileNet slightly increases. We should explore this X / Not X behaviour in more detail.

*CLIP and ZeroShot-MobileNetv2 Top 1 accuracy on the nocaps validation dataset, broken out by image domain.*

Looking across domain types we see a fairly flat response across both models for Top 1 accuracy. Whilst I was expecting more of a change from CLIP across the domains, I’m happy to see that MobileNet performance matches CLIP’s trend as expected.

Example nocaps labels chosen for this image by CLIP and ZeroShot MobileNet. CLIP consistently labelled the image correctly regardless of the label set, whilst MobileNet gets it wrong more often than not.

Delving in a little further, we want to see what happens when MobileNet fails but CLIP succeeds. We can see below that when repeatedly run on different label sets (k=5), CLIP consistently labels the image correctly whilst MobileNet mostly gets it wrong. Only one of these labels is anywhere close (“Two women in wheelchairs…”) whilst most are nonsensically far. Testing shows that for k = 20, we only achieve ~0.8% accuracy - worse than chance. Since we can see that CLIP is capable of classifying this image, we can likely conclude that our feature translation is to blame on this image. Training on a wider dataset may improve performance here. We can see that there is a general correlation between encoding error and score difference as expected. We don’t know how close we can get to exactly translating MobileNet to CLIP, but we can expect our performance to improve as we get nearer.

With this level of assessment I’d be comfortable saying we’ve hit TRL3 - “Analytical and experimental critical function and/or characteristic proof of concept”. We’ve shown that our core concept works and can measure its core performance. Next up: improving the model and validating it on an edge computing system.

From Idea to Innovation: Part 3

Diving into 2023…and the Return of the Matt

From Idea To Innovation - Part 2