AMMICO/docs/source/notebooks/Example multimodal.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "22df2297-0629-45aa-b88c-6c61f1544db6",
   "metadata": {},
   "source": [
    "# Image Multimodal Search"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9eeeb302-296e-48dc-86c7-254aa02f2b3a",
   "metadata": {},
   "source": [
    "This notebooks shows some preliminary work on Image Multimodal Search with lavis library. It is mainly meant to explore its capabilities and to decide on future research directions. We package our code into a `ammico` package that is imported here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f10ad6c9-b1a0-4043-8c5d-ed660d77be37",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import ammico\n",
    "import ammico.multimodal_search as ms"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "acf08b44-3ea6-44cd-926d-15c0fd9f39e0",
   "metadata": {},
   "source": [
    "Set an image path as input file path."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d3fe589-ff3c-4575-b8f5-650db85596bc",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "images = ammico.utils.find_files(\n",
    "    path=\"data/\",\n",
    "    limit=10,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "adf3db21-1f8b-4d44-bbef-ef0acf4623a0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "mydict = ammico.utils.initialize_dict(images)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "987540a8-d800-4c70-a76b-7bfabaf123fa",
   "metadata": {},
   "source": [
    "## Indexing and extracting features from images in selected folder"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66d6ede4-00bc-4aeb-9a36-e52d7de33fe5",
   "metadata": {},
   "source": [
    "You can choose one of the following models: blip, blip2, albef, clip_base, clip_vitl14, clip_vitl14_336"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7bbca1f0-d4b0-43cd-8e05-ee39d37c328e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "model_type = \"blip\"\n",
    "# model_type = \"blip2\"\n",
    "# model_type = \"albef\"\n",
    "# model_type = \"clip_base\"\n",
    "# model_type = \"clip_vitl14\"\n",
    "# model_type = \"clip_vitl14_336\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca095404-57d0-4f5d-aeb0-38c232252b17",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "(\n",
    "    model,\n",
    "    vis_processors,\n",
    "    txt_processors,\n",
    "    image_keys,\n",
    "    image_names,\n",
    "    features_image_stacked,\n",
    ") = ms.MultimodalSearch.parsing_images(mydict, model_type, path_to_saved_tensors=\".\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ff8a894-566b-4c4f-acca-21c50b5b1f52",
   "metadata": {},
   "source": [
    "The tensors of all images `features_image_stacked` was saved in `<Number_of_images>_<model_name>_saved_features_image.pt`. If you run it once for current model and current set of images you do not need to repeat it again. Instead you can load this features with the command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56c6d488-f093-4661-835a-5c73a329c874",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# (\n",
    "#    model,\n",
    "#    vis_processors,\n",
    "#    txt_processors,\n",
    "#    image_keys,\n",
    "#    image_names,\n",
    "#    features_image_stacked,\n",
    "# ) = ms.MultimodalSearch.parsing_images(mydict, model_type,\"18_clip_base_saved_features_image.pt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "309923c1-d6f8-4424-8fca-bde5f3a98b38",
   "metadata": {},
   "source": [
    "Here we already processed our image folder with 18 images with `clip_base` model. So you need just write the name `18_clip_base_saved_features_image.pt` of the saved file that consists of tensors of all images as a 3rd argument to the previous function. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "162a52e8-6652-4897-b92e-645cab07aaef",
   "metadata": {},
   "source": [
    "Next, you need to form search queries. You can search either by image or by text. You can search for a single query, or you can search for several queries at once, the computational time should not be much different. The format of the queries is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4196a52-d01e-42e4-8674-5712f7d6f792",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "search_query3 = [\n",
    "    {\"text_input\": \"politician press conference\"},\n",
    "    {\"text_input\": \"a person wearing a mask\"},\n",
    "    {\"image\": \"data/106349S_por.png\"},\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bcf3127-3dfd-4ff4-b9e7-a043099b1418",
   "metadata": {},
   "source": [
    "You can filter your results in 3 different ways:\n",
    "- `filter_number_of_images` limits the number of images found. That is, if the parameter `filter_number_of_images = 10`, then the first 10 images that best match the query will be shown. The other images ranks will be set to `None` and the similarity value to `0`.\n",
    "- `filter_val_limit` limits the output of images with a similarity value not bigger than `filter_val_limit`. That is, if the parameter `filter_val_limit = 0.2`, all images with similarity less than 0.2 will be discarded.\n",
    "- `filter_rel_error` (percentage) limits the output of images with a similarity value not bigger than `100 * abs(current_simularity_value - best_simularity_value_in_current_search)/best_simularity_value_in_current_search < filter_rel_error`. That is, if we set filter_rel_error = 30, it means that if the top1 image have 0.5 similarity value, we discard all image with similarity less than 0.35."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f7dc52f-7ee9-4590-96b7-e0d9d3b82378",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "similarity = ms.MultimodalSearch.multimodal_search(\n",
    "    mydict,\n",
    "    model,\n",
    "    vis_processors,\n",
    "    txt_processors,\n",
    "    model_type,\n",
    "    image_keys,\n",
    "    features_image_stacked,\n",
    "    search_query3,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1cf7e46-0c2c-4fb2-b89a-ef585ccb9339",
   "metadata": {},
   "source": [
    "After launching `multimodal_search` function, the results of each query will be added to the source dictionary.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ad74b21-6187-4a58-9ed8-fd3e80f5a4ed",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "mydict[\"106349S_por\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd3ee120-8561-482b-a76a-e8f996783325",
   "metadata": {},
   "source": [
    "A special function was written to present the search results conveniently. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4324e4fd-e9aa-4933-bb12-074d54e0c510",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ms.MultimodalSearch.show_results(mydict, search_query3[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d86ab96b-1907-4b7f-a78e-3983b516d781",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Save search results to csv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bdbc4d4-695d-4751-ab7c-d2d98e2917d7",
   "metadata": {
    "tags": []
   },
   "source": [
    "Convert the dictionary of dictionarys into a dictionary with lists:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c6ddd83-bc87-48f2-a8d6-1bd3f4201ff7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "outdict = ammico.utils.append_data_to_dict(mydict)\n",
    "df = ammico.utils.dump_df(outdict)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea2675d5-604c-45e7-86d2-080b1f4559a0",
   "metadata": {
    "tags": []
   },
   "source": [
    "Check the dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e78646d6-80be-4d3e-8123-3360957bcaa8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05546d99-afab-4565-8f30-f14e1426abcf",
   "metadata": {},
   "source": [
    "Write the csv file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "185f7dde-20dc-44d8-9ab0-de41f9b5734d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "df.to_csv(\"./data_out.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ef1132f-eb2a-43d7-be1f-69e879490f33",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}