In this article, we’ll show how to use the CLIP model from OpenAI for Text-to-Image and Image-to-Image searching. We’ll also do a comparative analysis of the PyTorch model, FP16 OpenVINO format, and INT8 OpenVINO format in terms of speedup.
Here’s a summary of what’s covered:
- Using the PyTorch model
- Using OpenVINO conversion to speed up by 70%
- Using Quantization with OpenVINO NNCF to speed up by 400%
All results reported below are from a 13th Gen Intel(R) Core(TM) i5–13420H* using OpenVINO=2023.2 and NNCF=2.7.0 version.***
If you’d like to code along, here’s a Colab notebook with all the code you need to get started!
CLIP from OpenAI
CLIP (Contrastive Language–Image Pre-training) is a neural network capable of processing both images and text.
CLIP is a multimodal model, which means it can process both text and images. This capability allows it to embed different types of inputs in a shared multimodal space, where the positions of images and text have semantic meaning, regardless of their format.
The following image presents a visualization of the pre-training procedure

OpenVINO by Intel
OpenVINO toolkit is a free toolkit facilitating the optimization of a deep learning model from a framework and deploying an inference engine onto Intel hardware. We’ll use the FP16 and INT8 formats using the OpenVINO CLIP model. This post demonstrates how to use OpenVINO to accelerate an embedding pipeline in LanceDB.
Implementation
In the implementation section, we see the comparative implementation of the CLIP model from Hugging Face and OpenVINO formats, using the conceptual caption dataset. We start with the first step of loading the conceptual caption dataset from Hugging Face.
# https://huggingface.co/datasets/conceptual_captions
image_data = load_dataset(
"conceptual_captions", split="train",
)We will select a sample of 100 images from this large number of images
# taking first 100 images
image_data_df = pd.DataFrame(image_data[:100])Helper functions to validate image URLs and get images and captions from image URL
def check_valid_URLs(image_URL):
"""
Not all the URLs are valid. This function returns True if the URL is valid. False otherwise.
"""
try:
response = requests.get(image_URL)
Image.open(BytesIO(response.content))
return True
except Exception:
return False
def get_image(image_URL):
response = requests.get(image_URL)
image = Image.open(BytesIO(response.content)).convert("RGB")
return image
def get_image_caption(image_ID):
return image_data[image_ID]["caption"]
# Transform dataframe
image_data_df["is_valid"] = image_data_df["image_url"].apply(check_valid_URLs)
# removing all the invalid URLs
image_data_df = image_data_df[image_data_df["is_valid"] == True]
image_data_df.head()Now we have prepared the dataset and we are ready to start with CLIP using Hugging Face and OpenVINO and their performance comparative analysis in terms of speed.
PyTorch CLIP using Hugging Face
We’ll start with CLIP using Hugging Face and report the time taken to extract embeddings and search using LanceDB.
def get_model_info(model_ID, device):
"""
Loading CLIP from HuggingFace
"""
# Save the model to device
model = CLIPModel.from_pretrained(model_ID).to(device)
# Get the processor
processor = CLIPProcessor.from_pretrained(model_ID)
# Get the tokenizer
tokenizer = CLIPTokenizer.from_pretrained(model_ID)
# Return model, processor & tokenizer
return model, processor, tokenizer
# Set the device
device = "cuda" if torch.cuda.is_available() else "cpu"
model_ID = "openai/clip-vit-base-patch16"
model, processor, tokenizer = get_model_info(model_ID, device)Let’s write a helper function to extract text and image embeddings:
def get_single_text_embedding(text):
# Get single text embeddings
inputs = tokenizer(text, return_tensors="pt").to(device)
text_embeddings = model.get_text_features(**inputs)
# convert the embeddings to numpy array
embedding_as_np = text_embeddings.cpu().detach().numpy()
return embedding_as_np
def get_all_text_embeddings(df, text_col):
# Get all the text embeddings
df["text_embeddings"] = df[str(text_col)].apply(get_single_text_embedding)
return df
def get_single_image_embedding(my_image):
# Get single image embeddings
image = processor(text=None, images=my_image, return_tensors="pt")["pixel_values"].to(device)
embedding = model.get_image_features(image)
# convert the embeddings to numpy array
embedding_as_np = embedding.cpu().detach().numpy()
return embedding_as_np
def get_all_images_embedding(df, img_column):
# Get all image embeddings
df["img_embeddings"] = df[str(img_column)].apply(get_single_image_embedding)
return dfUse LanceDB for storing the embeddings & search
import lancedb
db = lancedb.connect("./.lancedb")Extracting Embeddings of 83 images using CLIP Hugging faces model and time taken to extract embeddings.
import time
# extracting embeddings using Hugging Face
start_time = time.time()
image_data_df = get_all_images_embedding(image_data_df, "image")
print(f"Time Taken to extract Embeddings of {len(image_data_df)} Images(in seconds): ", time.time()-start_time)This pipeline to extract embeddings of 83 images took 55.79 sec.
Data ingestion and creating embeddings in LanceDB
Next, we show how to create the embeddings and ingest them into LanceDB.
def create_and_ingest(image_data_df, table_name):
"""
Create and Ingest Extracted Embeddings using Hugging Face
"""
image_url = image_data_df.image_url.tolist()
image_embeddings = [arr.astype(np.float32).tolist() for arr in image_data_df.img_embeddings.tolist()]
data = []
for i in range(len(image_url)):
temp = {}
temp['vector'] = image_embeddings[i][0]
temp['image'] = image_url[i]
data.append(temp)
# Create a Table
tbl = db.create_table(name=table_name, data=data, mode="overwrite")
return tbl
# Create and ingest embeddings for pt model
pt_tbl = create_and_ingest(image_data_df, "pt_table")Query the embeddings
You can easily query the embeddings via similarity in LanceDB as follows:
# Get the image embedding and query for each caption
pt_img_results = {}
start_time = time.time()
for i in range(len(image_data_df)):
img_query = image_data_df.iloc[i].image
query_embedding = get_single_image_embedding(img_query).tolist()
# querying with image
result = pt_tbl.search(query_embedding[0]).limit(4).to_list()
pt_img_results[str(i)] = resultCLIP model using FP16 OpenVINO format
Next, we’ll show the results from the same pipeline with the CLIP F16 OpenVINO format.
import openvino as ov
# saving openvino model
model.config.torchscript = True
ov_model = ov.convert_model(model, example_input=dict(inputs))
ov.save_model(ov_model, 'clip-vit-base-patch16.xml')Compiling the CLIP OpenVINO model
import numpy as np
from scipy.special import softmax
from openvino.runtime import Core
"""
Compiling CLIP in Openvino FP16 format
"""
# create OpenVINO core object instance
core = Core()
# compile model for loading on device
compiled_model = core.compile_model(ov_model, device_name="AUTO", config={"PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT"})
# obtain output tensor for getting predictionsExtracting the embeddings of 83 images using CLIP FP16 OpenVINO model now takes 31.79 seconds – this is a 43% reduction!
import time
# time taken to extract embeddings using CLIP OpenVINO format
start_time = time.time()
image_embeddings = extract_openvino_embeddings(image_data_df)
print(f"Time Taken to extract Embeddings of {len(image_data_df)} Images(in seconds): ", time.time()-start_time)The embeddings can be ingested to LanceDB the same as before:
def create_and_ingest_openvino(image_url, image_embeddings, table_name):
"""
Create and Ingest Extracted Embeddings using OpenVINO FP16 format
"""
data = []
for i in range(len(image_url)):
temp = {}
temp['vector'] = image_embeddings[i]
temp['image'] = image_url[i]
data.append(temp)
# Create a Table
tbl = db.create_table(name=table_name, data=data, mode="overwrite")
return tbl
# create and ingest embeddings for OpenVINO fp16 model
ov_tbl = create_and_ingest_openvino(image_url, image_embeddings, "ov_tbl")We query the embeddings and run search just like before:
# Get the image embedding and query for each caption
ov_img_results = {}
start_time = time.time()
for i in range(len(image_data_df)):
img_query = image_data_df.iloc[i].image
image = image_data_df.iloc[i].image
inputs = processor(images=[image], return_tensors="np", padding=True)
query_embedding = compiled_model(dict(inputs))["image_embeds"][0]
# querying with query image
result = ov_tbl.search(query_embedding).limit(4).to_list()
ov_img_results[str(i)] = resultNNCF INT 8-bit Quantization
You can also use 8-bit Post Training Optimization from NNCF (Neural Network Compression Framework) and run inference on the quantized model via OpenVINO Toolkit.
pip install -q datasets
pip install -q "nncf>=2.7.0"import os
from transformers import CLIPProcessor, CLIPModel
fp16_model_path = 'clip-vit-base-patch16.xml'
#inputs preparation for conversion and creating processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
max_length = model.config.text_config.max_position_embeddings
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")Here’s a helper function to convert into Int8 format using NNCF:
import requests
from io import BytesIO
import numpy as np
from PIL import Image
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def check_text_data(data):
"""
Check if the given data is text-based.
"""
if isinstance(data, str):
return True
if isinstance(data, list):
return all(isinstance(x, str) for x in data)
return False
def get_pil_from_url(url):
"""
Downloads and converts an image from a URL to a PIL Image object.
"""
response = requests.get(url, verify=False, timeout=20)
image = Image.open(BytesIO(response.content))
return image.convert("RGB")
def collate_fn(example, image_column="image_url", text_column="caption"):
"""
Preprocesses an example by loading and transforming image and text data.
Checks if the text data in the example is valid by calling the `check_text_data` function.
Downloads the image specified by the URL in the image_column by calling the `get_pil_from_url` function.
If there is any error during the download process, returns None.
Returns the preprocessed inputs with transformed image and text data.
"""
assert len(example) == 1
example = example[0]
if not check_text_data(example[text_column]):
raise ValueError("Text data is not valid")
url = example[image_column]
try:
image = get_pil_from_url(url)
h, w = image.size
if h == 1 or w == 1:
return None
except Exception:
return None
#preparing inputs for processor
inputs = processor(text=example[text_column], images=[image], return_tensors="pt", padding=True)
if inputs['input_ids'].shape[1] > max_length:
return None
return inputsInitializing NNCF and Saving the Quantized Model
import logging
import nncf
from openvino.runtime import Core, serialize
core = Core()
# Initialize NNCF
nncf.set_log_level(logging.ERROR)
int8_model_path = 'clip-vit-base-patch16_int8.xml'
calibration_data = prepare_dataset()
ov_model = core.read_model(fp16_model_path)
if len(calibration_data) == 0:
raise RuntimeError(
'Calibration dataset is empty. Please check internet connection and try to download images manually.'
)
#Quantize CLIP fp16 model using NNCF
calibration_dataset = nncf.Dataset(calibration_data)
quantized_model = nncf.quantize(
model=ov_model,
calibration_dataset=calibration_dataset,
model_type=nncf.ModelType.TRANSFORMER,
)
#Saving Quantized model
serialize(quantized_model, int8_model_path)Compiling the INT8 model and Helper function for extracting features
import numpy as np
from scipy.special import softmax
from openvino.runtime import Core
# create OpenVINO core object instance
core = Core()
# compile model for loading on device
compiled_model = core.compile_model(quantized_model, device_name="AUTO", config={"PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT"})
# obtain output tensor for getting predictions
image_embeds = compiled_model.output(0)
logits_per_image_out = compiled_model.output(0)
image_url = image_data_df.image_url.tolist()
image_embeddings = []
def extract_quantized_openvino_embeddings(image_data_df, compiled_model):
"""
Extract embeddings of Images using CLIP Quantized model
"""
for i in range(len(image_data_df)):
image = image_data_df.iloc[i].image
inputs = processor(images=[image], return_tensors="np", padding=True)
image_embeddings.append(compiled_model(dict(inputs))["image_embeds"][0])
return image_embeddings
import time
start_time = time.time()
# time taken to extract embeddings using CLIP OpenVINO format
image_embeddings = extract_quantized_openvino_embeddings(image_data_df, compiled_model)
print(f"Time Taken to extract Embeddings of {len(image_data_df)} Images(in seconds): ", time.time()-start_time)With the updated pipeline using CLIP OpenVINO format, the time taken to extract embeddings of 83 images is brought down to just 13.70 sec! That’s a 75.4% reduction from the original CLIP model!
We can ingest the embeddings into LanceDB as follows:
# create and ingest embeddings for OpenVINO int8 model format
qov_tbl = create_and_ingest_openvino(image_url, image_embeddings, "qov_tbl")# Get the image embedding and query for each caption
ov_img_results = {}
start_time = time.time()
for i in range(len(image_data_df)):
img_query = image_data_df.iloc[i].image
image = image_data_df.iloc[i].image
inputs = processor(images=[image], return_tensors="np", padding=True)
query_embedding = compiled_model(dict(inputs))["image_embeds"][0]
# querying with query image
result = qov_tbl.search(query_embedding).limit(4).to_list()
ov_img_results[str(i)] = resultWe’ve now shown the performance improvement using all the CLIP model formats PyTorch from Hugging Face, FP16 OpenVINO, and INT8 OpenVINO.
Conclusions
All these results are on CPU for comparison of the PyTorch model with the OpenVINO model formats(FP16/ INT8)
| Format | Time (s) |
|---|---|
| PyTorch model from Hugging Face | 55.26 |
| OpenVINO FP16 format | 31.79 |
| OpenVINO INT8 format | 13.70 |
The performance acceleration achieved with an FP16 model is 1.73 times the PyTorch model, which is a relatively modest (yet decent) increase in speed.
However, when switching to the INT8 OpenVINO format, there is a 4.03 times increase in speed compared to the PyTorch model.
Visit the LanceDB GitHub to learn more about how to work with vector search at scale, and for more such tutorials and demo applications, visit the vectordb-recipes repo. For the latest updates from LanceDB, follow our LinkedIn and X pages.