this is aaronland

things I have written about elsewhere #20260109

Similar object images derived using the MobileCLIP computer-vision models

Negative: San Francisco International Airport (SFO), architectural model. Negative. Collection of SFO Museum, SFO Museum Collection. 2011.032.0817

This was originally published on the SFO Museum Mills Field weblog, in January 2026.

The following are SFO Museum Aviation Collection object images whose mathematical representations are similar to this image. These representations are referred to as vector embeddings and were derived using a series of different machine-learning models. The definition of "similar" in the eyes of any given machine-learning model is often opaque, typically mysterious and sometimes incorrect. As such these objects are included because they are generally similar in nature, or spirit, and as a way to help you discover things in our collection which you might have otherwise missed.

This is the preamble, or caveat (or warning), for a new feature we've added to the SFO Museum Aviation Collection website: Similar object images derived using machine-learning computer-vision models. To do this we have created something called "vector embeddings" for every object image published on the collection website. Embeddings are a multi-dimensional mathematical encoding of an image generated by a computer vision machine-learning model. These embeddings are then stored in a "vector" database, a specialized database optimized for indexing and comparing embeddings; for finding embeddings that are "near" one another, occupying some or all of the same area, in a multi-dimensional space.

Postcard: United Air Lines, Douglas DC-3, radio dispatch. Paper, ink. Gift of Thomas G. Dragges, SFO Museum Collection. 2015.166.1691

Different models will create different embeddings for the same image so you would be right to wonder what criteria any given model is using, and what data it was trained on, to determine those embeddings. That alone is reason enough for us to emphasize that "the definition of "similar" in the eyes of any given machine-learning model is often opaque, typically mysterious and sometimes incorrect" but like a lot of things involving machine-learning, these image similarity results while not always right aren't necessarily wrong either. In the same vein as searching collections by color this "fuzzy" and imprecise space presents a whole new avenue for browsing collections and making visible objects that would otherwise get lost in the crowd.

Starting today hen you scroll down past the Description text on any given object page (on the SFO Museum Aviation Collection website) you'll see a new section called "Similar" objects. For example, this is what you'll see on the object page for this water pitcher from Pan American World Airlines:

Water pitcher: Pan American World Airways. Stainless steel, wood. Gift in memory of Robert L. Stubbs, Pan Am Flight Engineer, SFO Museum Collection. 2011.135.003

Or this sunrise at SFO in 1977:

Slide: San Francisco International Airport (SFO), sunrise [digital image]. Digital file. Gift of Daniel Hahn, SFO Museum Collection. 2015.038.036

Or this PanAm flight crew having drinks in Buenos Aires in June, 1950:

Photograph: Pan American World Airways, Buenos Aires, Evelyn David. Photograph. Gift of Evelyn R. David, SFO Museum Collection. 2012.144.045

Or this fan from Japan Air Lines:

Fixed fan: Japan Air Lines. Bamboo, paper, ink. Gift of Thomas G. Dragges, SFO Museum Collection. 2002.035.186

Or this glass negative of a monoplane taking off at the 1915 Panama-Pacific International Exposition, in San Francisco:

Glass negative: Panama-Pacific International Exposition. Glass negative. Gift of Edwin I. Power, Jr. and Linda L. Liscom, SFO Museum Collection. 2010.282.025 a b

Or this poster of Swissair's fleet history:

Poster: Swissair, fleet history. Paper, ink. Gift of the William Hough Collection, SFO Museum Collection. 2006.010.431

Why is there one lonely safety information card included among all the other posters? Only the model knows...

I could do this all day! The rest of this blog post gets in to some of the technical details of how we're doing this. If that's not of interest to you this is a good place to stop reading and head over the SFO Museum Aviation Collection website to enjoy seeing where all of these "similar" images will take you.

If you are not interested in the technical plumbing but are still curious to know which models we're using: As of this writing we are generating four sets of embeddings using Apple's MobileCLIP models. When we display "similar" images on an object page we are showing the unique set of images derived from all four models. If you want to see the complete results and the scores assigned by a specific model you can find them by appending /images to any given object web page URL. This will load a new web page listing all the images associated with that object and each image will link to its own dedicated webpage containing the catalog of "similar" images, the model(s) used to establish similarity and the "distances" from the orignal image in question. For example this is what you might see for the image associated with the American Airlines children's inflight activity book from 1997:

Children's inflight activity book: American Airlines. Paper, metal, ink. Gift of Thomas G. Dragges in memory of Robert May, SFO Museum Collection. 2001.109.263

How does it work?

Photograph: Pan American Airways System, Sikorsky S-42. Photograph. Gift of M.D. Klaas, SFO Museum Collection. 2018.112.0384

Under the hood we are using Apple's MobileCLIP models introduced with the MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training paper in 2024:

Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP – a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach, namely multi-modal reinforced training.

We have refactored Apple's example Swift code to use those models, originally published in the apple/mobileclip-ml repository, such that it can be used as standalone library code and in command-line applications. Our updates include a command line tool (called embeddings) as well as a gRPC server (called embeddings-grpcd) which will emit vector embeddings for text and image input from the command-line or network requests respectively. That code is available from our GitHub account:

We have also released code, written in Go, for interacting with the Swift gRPC server:

For example, starting the Swift gRPC server:

$> bin/embeddings-grpcd --models=/path/to/models
2026-01-08T14:40:37-0800 info org.sfomuseum.embeddings.grpcd: [gRPCServer] listening for requests on 127.0.0.1:8080 (models:/path/to/Desktop/Models)

And then querying for text embeddings, using the Go client:

$> echo "hello world" | ./bin/embeddings -client-uri 'grpc://localhost:8080' text -
{"embeddings":[-0.3161621,-0.1697998,1.4482422,0.04 ... and so on

Or image embeddings:

$> ./bin/embeddings image -client-uri 'grpc://localhost:8080' test14.png 
{"embeddings":[-0.025161743,-0.027786255,-0.0014038086 ... and so on

That's all this software does: Given an image (or a very short text string, which is a whole other story) it will generate vector embeddings with 512 dimensions using one of the four Apple MobileCLIP models. It doesn't store, index or query those embeddings. These are all tasks left to separate processes. There are many different vector databases available but few of them produce the vectors they store. The swift-mobileclip package, and its associated tools, try to follow SFO Museum's broader goal of producing "small focused tools" for completing a specific task (generating embeddings) independent of its application or larger context.

We are able to create a signed and notarized embeddings (gRPC) server that can run with minimal installation requirements both locally and on dedicated consumer-grade hardware allowing it to be integrated in to our existing workflows with little extra overhead. The embeddings-grpcd server will even run on older Intel-based Macs – it's very slow but it does work. We are generating embeddings for collection images offline and then, in a secondary process, iterating through each embedding looking for similar objects. Those results are then written to the sfomuseum-data-media-collection repository. Those data are stored in the millsfield:similar property and look like this:

"millsfield:similar": {
      "apple/mobileclip_s0": [
        {
          "id": 1527825529,
          "similarity": 0.43360552
        },
        {
          "id": 1527826973,
          "similarity": 0.4882533
        },
	... and so on

This means we are not doing real-time image similarity search on the SFO Museum Aviation Collection website but at the same time, setting aside the very real economic and environmental costs, it's not clear that we have a compelling use case for doing so at the moment. What we've got now takes a non-zero amount of time to rebuild, for example when new objects are published to the collections website, but it remains manageable and the results it produces yield demonstrable immediate improvements for browsing our collection.

Photograph: Pan American World Airways, Atlantic Division. Photograph. Gift of William Craig, SFO Museum Collection. 2023.093.153 a b

There are a handful of open issues with the Swift code which will be addressed as time and circumstances permit, notably the use of Apple's newer MobileCLIP2 models. There is some good progress that has been made here so hopefully we'll have an update soon but we would welcome any suggestions or improvements to the work we've done to date.

Going forward I would like to do some experiments using these tools to generate embeddings for other museum collections as a way to see how and where they might "hold hands" with the objects in our collection. I mentioned that we are writing the results of embeddings-based queries to the sfomuseum-data-media-collection data repository. We are not writing the embeddings themselves to those data files if only because that would add a little under 2GB of additional data to an already oversized repository. All things being equal I would like to publish the embeddings of our collection images for others to be able to use. If nothing else it would save people the time and the "computrons" of regenerating those embeddings all over again. Some initial testing suggests that we could distribute them as a much smaller Parquet file, similar to how we already publish flight data at SFO as GeoParquet files. Stay tuned!