Indexing iCloud Photos with AI Using LLaVA and pgvector
๐ Abstract
The article discusses the author's hobby project of leveraging a multi-modal large language model (LLM) to improve the semantic search on their iCloud photo archive. The key points are:
๐ Q&A
[01] Indexing iCloud Photos with AI Using LLaVA and pgvector
1. What is the author's goal for this hobby project? The author's goal is to leverage a multi-modal LLM that can understand images to improve the semantic search on their photo archive in iCloud. They want to ask the LLM what it sees in an image, and embed the response as a vector using pgvector to enable users to search their photos.
2. Why did the author choose to use LLaVA as the LLM model? The author chose LLaVA with Q4 because it has a REST API that makes deployment extremely easy, and the just executable llamafile allows for simple usage.
3. What prompts did the author try to get better descriptions from the LLM? The author tried several prompts, including:
- A conversational prompt between a user and an AI assistant
- A concise image summary request
- A detailed image analysis dialogue
- An interactive session for image analysis and description
4. Why did the author not want to use ChatGPT/GPT-4V for this project? The author wants to avoid relying on a single company (OpenAI) and instead support open-source LLMs. They believe the ability to run an LLM with vision on one's own computer is amazing, and they don't want to upload their entire photo album to a company with questionable privacy practices.
[02] Embeddings and pgvector
1. How does the author store the image descriptions and embeddings? The author stores the image descriptions and embeddings in a Postgres database using the pgvector extension, which allows for easy vector similarity queries.
2. What are the steps in the author's Python code for generating descriptions, embeddings, and querying them? The steps are:
- Iterate through the files in the iCloud Photo library
- Prompt the local LLaVA model using the HTTP API to get a description for each image
- Encode the descriptions using a sentence transformer model (all-MiniLM-L6-v2)
- Store the filename, prompt, model, description, and embedding in the Postgres database
- For querying, use the same sentence transformer model to encode the query, and then use pgvector to find the closest matching image descriptions
[03] Results and Future Improvements
1. How did the author evaluate the results of the image search? The author tried various queries, such as "black car", "white car", "rainy night", "transporting stuff", "toddler playing outside, sunny and grass", "colorful night" vs "dark night", and "network switch". The results were surprisingly good, with minimal hallucination, for the author's dataset of ~4,000 images.
2. What are some possible improvements the author suggests for the project? Possible improvements include:
- Incorporating metadata like location and date of the images
- Using face recognition and names of people
- Categorizing images with short prompts and clustering common themes
- Using multiple prompts to describe different aspects of the images (objects, scenery, emotion, colors, etc.)
- Re-ranking the search results using classic information retrieval techniques
- Testing with various model quantizations to improve speed