зеркало из
https://github.com/ssciwr/AMMICO.git
synced 2025-10-29 13:06:04 +02:00
update notebook content
Этот коммит содержится в:
родитель
ff6de1c436
Коммит
9d382b7b6b
@ -664,38 +664,15 @@
|
||||
"\n",
|
||||
"<img src=\"../../docs/source/_static/summary_detector.png\" width=\"800\" />\n",
|
||||
"\n",
|
||||
"This module is based on the [LAVIS](https://github.com/salesforce/LAVIS) library. Since the models can be quite large, an initial object is created which will load the necessary models into RAM/VRAM and then use them in the analysis. The user can specify the type of analysis to be performed using the `analysis_type` keyword. Setting it to `summary` will generate a caption (summary), `questions` will prepare answers (VQA) to a list of questions as set by the user, `summary_and_questions` will do both. Note that the desired analysis type needs to be set here in the initialization of the \n",
|
||||
"detector object, and not when running the analysis for each image; the same holds true for the selected model."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The implemented models are listed below.\n",
|
||||
"### Multimodal Summary Model\n",
|
||||
"\n",
|
||||
"| input model name | model |\n",
|
||||
"| ---------------- | ----- |\n",
|
||||
"| base | BLIP image captioning base, ViT-B/16, pretrained on COCO dataset |\n",
|
||||
"| large | BLIP image captioning large, ViT-L/16, pretrained on COCO dataset |\n",
|
||||
"| vqa | BLIP base model fine-tuned on VQA v2.0 dataset |\n",
|
||||
"| blip2_t5_pretrain_flant5xxl | BLIP2 pretrained on FlanT5<sub>XXL</sub> | \n",
|
||||
"| blip2_t5_pretrain_flant5xl | BLIP2 pretrained on FlanT5<sub>XL</sub> | \n",
|
||||
"| blip2_t5_caption_coco_flant5xl | BLIP2 pretrained on FlanT5<sub>XL</sub>, fine-tuned on COCO | \n",
|
||||
"| blip2_opt_pretrain_opt2.7b | BLIP2 pretrained on OPT-2.7b |\n",
|
||||
"| blip2_opt_pretrain_opt6.7b | BLIP2 pretrained on OPT-6.7b | \n",
|
||||
"| blip2_opt_caption_coco_opt2.7b | BLIP2 pretrained on OPT-2.7b, fine-tuned on COCO | \n",
|
||||
"| blip2_opt_caption_coco_opt6.7b | BLIP2 pretrained on OPT-6.7b, fine-tuned on COCO |\n",
|
||||
"This module is built on the Qwen2.5-VL model family. In this project, two model variants are supported: \n",
|
||||
"\n",
|
||||
"Please note that `base`, `large` and `vqa` models can be run on the base TPU video card in Google Colab.\n",
|
||||
"To run any advanced `BLIP2` models you need more than 20 gb of video memory, so you need to connect a paid A100 in Google Colab."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First of all, we can run only the summary module `analysis_type`. You can choose a `base` or a `large` model_type. "
|
||||
"1. `Qwen2.5-VL-3B-Instruct`, which requires approximately 3 GB of video memory to load.\n",
|
||||
"2. `Qwen2.5-VL-7B-Instruct`, which requires up to 8 GB of VRAM for initialization.\n",
|
||||
"\n",
|
||||
"Each version can be run on the CPU, but this will significantly increase the operating time, so we cannot recommend it, but we retain this option. \n",
|
||||
"The model type can be specified when initializing the `MultimodalSummaryModel` class:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -704,32 +681,120 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"image_summary_detector = ammico.SummaryDetector(\n",
|
||||
" image_dict, analysis_type=\"summary\", model_type=\"base\"\n",
|
||||
"model = ammico.MultimodalSummaryModel(\n",
|
||||
" model_id=\"Qwen/Qwen2.5-VL-7B-Instruct\"\n",
|
||||
") # or \"Qwen/Qwen2.5-VL-3B-Instruct\" respectively"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also define the preferred device type (\"cpu\" or \"cuda\") explicitly during initialization:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"model = ammico.MultimodalSummaryModel(\n",
|
||||
" model_id=\"Qwen/Qwen2.5-VL-7B-Instruct\", device=\"cuda\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"By default, the initialization follows this logic:\n",
|
||||
"\n",
|
||||
"If a GPU is available, it is automatically detected and the model defaults to Qwen2.5-VL-7B-Instruct on \"cuda\".\n",
|
||||
"\n",
|
||||
"If no GPU is detected, the system falls back to the Qwen2.5-VL-3B-Instruct model on the \"cpu\" device."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for num, key in tqdm(enumerate(image_dict.keys()), total=len(image_dict)):\n",
|
||||
" image_dict[key] = image_summary_detector.analyse_image(\n",
|
||||
" subdict=image_dict[key], analysis_type=\"summary\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" if num % dump_every == 0 | num == len(image_dict) - 1:\n",
|
||||
" image_df = ammico.get_dataframe(image_dict)\n",
|
||||
" image_df.to_csv(dump_file)"
|
||||
"model = ammico.MultimodalSummaryModel()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For VQA, a list of questions needs to be passed when carrying out the analysis; these should be given as a list of strings."
|
||||
"### Image Summary and VQA module\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To instantiate class it is required to provide `MultimodalSummaryModel` and dictionary"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"image_summary_vqa = ammico.ImageSummaryDetector(summary_model=model, subdict=image_dict)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To perform image analysis, use the analyse_images_from_dict() method.\n",
|
||||
"This function provides flexible options for generating summaries and performing visual question answering. \n",
|
||||
"1. `analysis_type` – defines the type of analysis to perform. Setting it to `summary` will generate a caption (summary), `questions` will prepare answers (VQA) to a list of questions as set by the user, `summary_and_questions` will do both.\n",
|
||||
"2. `list_of_questions` a list of text questions to be answered by the model. This parameter is required when analysis_type is set to \"questions\" or \"summary_and_questions\".\n",
|
||||
"3. `keys_batch_size` controls the number of images processed per batch. Increasing this value may slightly improve performance, depending on your system.\n",
|
||||
"The default is `16`, which provides a good balance between speed and stability on most setups.\n",
|
||||
"4. `is_concise_summary` – determines the level of detail in generated captions:\n",
|
||||
" * `True` → produces short, concise summaries.\n",
|
||||
" * `False` → produces longer, more descriptive captions that may include additional context or atmosphere, but take more time to compute.\n",
|
||||
"5. `is_concise_answer`– similar to the previous flag, but for controlling the level of detail in question answering responses."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Example Usage**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To generate a concise image summary only:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"summary = ammico.analyse_images_from_dict(\n",
|
||||
" analysis_type=\"summary\", is_concise_summary=True\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To generate detailed summaries and answer multiple questions:\n",
|
||||
"\n",
|
||||
"First, define a list of questions:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -749,7 +814,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"If you want to execute only the VQA module without captioning, just specify the `analysis_type` as `questions` and `model_type` as `vqa`. "
|
||||
"Then call the function:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -758,232 +823,40 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"image_summary_vqa_detector = ammico.SummaryDetector(\n",
|
||||
" image_dict, analysis_type=\"questions\", model_type=\"vqa\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"for num, key in tqdm(enumerate(image_dict.keys()), total=len(image_dict)):\n",
|
||||
" image_dict[key] = image_summary_vqa_detector.analyse_image(\n",
|
||||
" subdict=image_dict[key],\n",
|
||||
" analysis_type=\"questions\",\n",
|
||||
" list_of_questions=list_of_questions,\n",
|
||||
" )\n",
|
||||
" if num % dump_every == 0 | num == len(image_dict) - 1:\n",
|
||||
" image_df = ammico.get_dataframe(image_dict)\n",
|
||||
" image_df.to_csv(dump_file)"
|
||||
"summary_and_answers = ammico.analyse_images_from_dict(\n",
|
||||
" analysis_type=\"summary_and_questions\",\n",
|
||||
" list_of_questions=list_of_questions,\n",
|
||||
" is_concise_summary=False,\n",
|
||||
" is_concise_answer=False,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Or you can specify the analysis type as `summary_and_questions`, then both caption creation and question answers will be generated for each image. In this case, you can choose a `base` or a `large` model_type. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"image_summary_vqa_detector = ammico.SummaryDetector(\n",
|
||||
" image_dict, analysis_type=\"summary_and_questions\", model_type=\"base\"\n",
|
||||
")\n",
|
||||
"for num, key in tqdm(enumerate(image_dict.keys()), total=len(image_dict)):\n",
|
||||
" image_dict[key] = image_summary_vqa_detector.analyse_image(\n",
|
||||
" subdict=image_dict[key],\n",
|
||||
" analysis_type=\"summary_and_questions\",\n",
|
||||
" list_of_questions=list_of_questions,\n",
|
||||
" )\n",
|
||||
" if num % dump_every == 0 | num == len(image_dict) - 1:\n",
|
||||
" image_df = ammico.get_dataframe(image_dict)\n",
|
||||
" image_df.to_csv(dump_file)"
|
||||
"If you want to execute only the VQA module without captioning, just specify the `analysis_type` as `questions`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The output is given as a dictionary with the following keys and data types:\n",
|
||||
"The output of the `analyse_images_from_dict()` method is a dictionary, where each key corresponds to an input image identifier. Each entry in this dictionary contains the processed results for that image.\n",
|
||||
"\n",
|
||||
"| output key | output type | output value |\n",
|
||||
"| ---------- | ----------- | ------------ |\n",
|
||||
"| `const_image_summary` | `str` | when `analysis_type=\"summary\"` or `\"summary_and_questions\"`, constant image caption (does not change upon re-running the analysis for the same model) |\n",
|
||||
"| `3_non-deterministic_summary` | `list[str]` | when `analysis_type=\"summary\"` or `summary_and_questions`, three different captions generated with different random seeds |\n",
|
||||
"| *a user-defined input question* | `str` | when `analysis_type=\"questions\"` or `summary_and_questions`, the answer to the user-defined input question | \n"
|
||||
"| `caption` | `str` | when `analysis_type=\"summary\"` or `\"summary_and_questions\"`, constant image caption |\n",
|
||||
"| `vqa` | `list[str]` | when `analysis_type=\"questions\"` or `summary_and_questions`, the answers to the user-defined input question |\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### BLIP2 models\n",
|
||||
"The BLIP2 models are computationally very heavy models, and require approximately 60GB of RAM. These models can easily use more than 20GB GPU memory."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"obj = ammico.SummaryDetector(\n",
|
||||
" subdict=image_dict,\n",
|
||||
" analysis_type=\"summary_and_questions\",\n",
|
||||
" model_type=\"blip2_t5_caption_coco_flant5xl\",\n",
|
||||
")\n",
|
||||
"# list of the new models that can be used:\n",
|
||||
"# \"blip2_t5_pretrain_flant5xxl\",\n",
|
||||
"# \"blip2_t5_pretrain_flant5xl\",\n",
|
||||
"# \"blip2_t5_caption_coco_flant5xl\",\n",
|
||||
"# \"blip2_opt_pretrain_opt2.7b\",\n",
|
||||
"# \"blip2_opt_pretrain_opt6.7b\",\n",
|
||||
"# \"blip2_opt_caption_coco_opt2.7b\",\n",
|
||||
"# \"blip2_opt_caption_coco_opt6.7b\",\n",
|
||||
"\n",
|
||||
"# You can use `pretrain_` model types for zero-shot image-to-text generation with prompts.\n",
|
||||
"# Or you can use `caption_coco_`` model types to generate coco-style captions.\n",
|
||||
"# `flant5` and `opt` means that the model equipped with FlanT5 and OPT LLMs respectively.\n",
|
||||
"\n",
|
||||
"# also you can perform all calculation on cpu if you set device_type= \"cpu\" or gpu if you set device_type= \"cuda\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also pass a list of questions to this cell if `analysis_type=\"summary_and_questions\"` or `analysis_type=\"questions\"`. But the format of questions has changed in new models. \n",
|
||||
"\n",
|
||||
"Here is an example of a list of questions:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"list_of_questions = [\n",
|
||||
" \"Question: Are there people in the image? Answer:\",\n",
|
||||
" \"Question: What is this picture about? Answer:\",\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for key in image_dict:\n",
|
||||
" image_dict[key] = obj.analyse_image(\n",
|
||||
" subdict=image_dict[key],\n",
|
||||
" analysis_type=\"questions\",\n",
|
||||
" list_of_questions=list_of_questions,\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"# analysis_type can be\n",
|
||||
"# \"summary\",\n",
|
||||
"# \"questions\",\n",
|
||||
"# \"summary_and_questions\"."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also pass a question with previous answers as context into this model and pass in questions like this one to get a more accurate answer:\n",
|
||||
"\n",
|
||||
"You can combine as many questions as you want in a single query as a list."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"list_of_questions = [\n",
|
||||
" \"Question: What country is in the picture? Answer: USA. Question: Why? Answer: Because there is an American flag in the background . Question: Where it comes from? Answer:\",\n",
|
||||
" \"Question: Which city is this? Answer: Frankfurt. Question: Why?\",\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for key in image_dict:\n",
|
||||
" image_dict[key] = obj.analyse_image(\n",
|
||||
" subdict=image_dict[key],\n",
|
||||
" analysis_type=\"questions\",\n",
|
||||
" list_of_questions=list_of_questions,\n",
|
||||
" )"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"image_dict"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also ask sequential questions if you pass the argument `consequential_questions=True`. This means that the answers to previous questions will be passed as context to the next question. However, this method will work a bit slower, because for each image the answers to the questions will not be calculated simultaneously, but sequentially. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"list_of_questions = [\n",
|
||||
" \"Question: Is this picture taken inside or outside? Answer:\",\n",
|
||||
" \"Question: Why? Answer:\",\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for key in image_dict:\n",
|
||||
" image_dict[key] = obj.analyse_image(\n",
|
||||
" subdict=image_dict[key],\n",
|
||||
" analysis_type=\"questions\",\n",
|
||||
" list_of_questions=list_of_questions,\n",
|
||||
" consequential_questions=True,\n",
|
||||
" )"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"image_dict"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# write output to csv\n",
|
||||
"image_df = ammico.get_dataframe(image_dict)\n",
|
||||
"image_df.to_csv(\"/content/drive/MyDrive/misinformation-data/data_out.csv\")"
|
||||
"### Video summary and VQA module\n",
|
||||
"This module is currently under development and will be demonstrated here as soon as it is ready."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1069,369 +942,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This module shows how to carry out an image multimodal search with the [LAVIS](https://github.com/salesforce/LAVIS) library. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Indexing and extracting features from images in selected folder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First you need to select a model. You can choose one of the following models: \n",
|
||||
"- [blip](https://github.com/salesforce/BLIP)\n",
|
||||
"- [blip2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) \n",
|
||||
"- [albef](https://github.com/salesforce/ALBEF) \n",
|
||||
"- [clip_base](https://github.com/openai/CLIP/blob/main/model-card.md)\n",
|
||||
"- [clip_vitl14](https://github.com/mlfoundations/open_clip) \n",
|
||||
"- [clip_vitl14_336](https://github.com/mlfoundations/open_clip)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"model_type = \"blip\"\n",
|
||||
"# model_type = \"blip2\"\n",
|
||||
"# model_type = \"albef\"\n",
|
||||
"# model_type = \"clip_base\"\n",
|
||||
"# model_type = \"clip_vitl14\"\n",
|
||||
"# model_type = \"clip_vitl14_336\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To process the loaded images using the selected model, use the below code:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"image_dict = ammico.find_files(\n",
|
||||
" path=str(data_path),\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"image_dict"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"my_obj = ammico.MultimodalSearch(image_dict)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"(\n",
|
||||
" model,\n",
|
||||
" vis_processors,\n",
|
||||
" txt_processors,\n",
|
||||
" image_keys,\n",
|
||||
" image_names,\n",
|
||||
" features_image_stacked,\n",
|
||||
") = my_obj.parsing_images(\n",
|
||||
" model_type,\n",
|
||||
" path_to_save_tensors=str(data_path),\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The images are then processed and stored in a numerical representation, a tensor. These tensors do not change for the same image and same model - so if you run this analysis once, and save the tensors giving a path with the keyword `path_to_save_tensors`, a file with filename `.<Number_of_images>_<model_name>_saved_features_image.pt` will be placed there.\n",
|
||||
"\n",
|
||||
"This can save you time if you want to analyse the same images with the same model but different questions. To run using the saved tensors, execute the below code giving the path and name of the tensor file. Any subsequent query of the model will run in a fraction of the time than it run in initially."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# uncomment the code below if you want to load the tensors from the drive\n",
|
||||
"# and just want to ask different questions for the same set of images\n",
|
||||
"# (\n",
|
||||
"# model,\n",
|
||||
"# vis_processors,\n",
|
||||
"# txt_processors,\n",
|
||||
"# image_keys,\n",
|
||||
"# image_names,\n",
|
||||
"# features_image_stacked,\n",
|
||||
"# ) = my_obj.parsing_images(\n",
|
||||
"# model_type,\n",
|
||||
"# path_to_load_tensors=\"/content/drive/MyDrive/misinformation-data/5_clip_base_saved_features_image.pt\",\n",
|
||||
"# )"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here we already processed our image folder with 5 images and the `clip_base` model. So you need just to write the name `5_clip_base_saved_features_image.pt` of the saved file that consists of tensors of all images as keyword argument for `path_to_load_tensors`. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Formulate your search queries\n",
|
||||
"\n",
|
||||
"Next, you need to form search queries. You can search either by image or by text. You can search for a single query, or you can search for several queries at once, the computational time should not be much different. The format of the queries is as follows:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import importlib_resources # only require for image query example\n",
|
||||
"\n",
|
||||
"image_example_query = str(\n",
|
||||
" importlib_resources.files(\"ammico\") / \"data\" / \"test-crop-image.png\"\n",
|
||||
") # creating the path to the image for the image query example\n",
|
||||
"\n",
|
||||
"search_query = [\n",
|
||||
" {\n",
|
||||
" \"image\": image_example_query\n",
|
||||
" }, # This is how looks image query, here `image_example_path` is the path to query image like \"data/test-crop-image.png\"\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can filter your results in 3 different ways:\n",
|
||||
"- `filter_number_of_images` limits the number of images found. That is, if the parameter `filter_number_of_images = 10`, then the first 10 images that best match the query will be shown. The other images ranks will be set to `None` and the similarity value to `0`.\n",
|
||||
"- `filter_val_limit` limits the output of images with a similarity value not bigger than `filter_val_limit`. That is, if the parameter `filter_val_limit = 0.2`, all images with similarity less than 0.2 will be discarded.\n",
|
||||
"- `filter_rel_error` (percentage) limits the output of images with a similarity value not bigger than `100 * abs(current_similarity_value - best_similarity_value_in_current_search)/best_similarity_value_in_current_search < filter_rel_error`. That is, if we set filter_rel_error = 30, it means that if the top1 image have 0.5 similarity value, we discard all image with similarity less than 0.35."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"similarity, sorted_lists = my_obj.multimodal_search(\n",
|
||||
" model,\n",
|
||||
" vis_processors,\n",
|
||||
" txt_processors,\n",
|
||||
" model_type,\n",
|
||||
" image_keys,\n",
|
||||
" features_image_stacked,\n",
|
||||
" search_query,\n",
|
||||
" filter_number_of_images=20,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"similarity"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sorted_lists"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"After launching `multimodal_search` function, the results of each query will be added to the source dictionary. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"image_dict"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A special function was written to present the search results conveniently. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"my_obj.show_results(\n",
|
||||
" search_query[0], # you can change the index to see the results for other queries\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Formulate your search queries: Search for the best match using multiple reference images, for example, of a person"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Here goes the code that reads in multiple images as reference\n",
|
||||
"# then you will loop over these multiple images and find the best matches\n",
|
||||
"# in the end, the best matches will be averaged over for each picture and a list of averaged best matches will be provided"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Improve the search results: Use only for text queries, not image search\n",
|
||||
"\n",
|
||||
"For even better results, a slightly different approach has been prepared that can improve search results. It is quite resource-intensive, so it is applied after the main algorithm has found the most relevant images. This approach works only with text queries and it skips image queries. Among the parameters you can choose 3 models: `\"blip_base\"`, `\"blip_large\"`, `\"blip2_coco\"`. If you get an `Out of Memory` error, try reducing the batch_size value (minimum = 1), which is the number of images being processed simultaneously. With the parameter `need_grad_cam = True/False` you can enable the calculation of the heat map of each image to be processed and save them in `image_gradcam_with_itm`. Thus the `image_text_match_reordering()` function calculates new similarity values and new ranks for each image. The resulting values are added to the general dictionary."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"itm_model = \"blip_base\"\n",
|
||||
"# itm_model = \"blip_large\"\n",
|
||||
"# itm_model = \"blip2_coco\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"itm_scores, image_gradcam_with_itm = my_obj.image_text_match_reordering(\n",
|
||||
" search_query,\n",
|
||||
" itm_model,\n",
|
||||
" image_keys,\n",
|
||||
" sorted_lists,\n",
|
||||
" batch_size=1,\n",
|
||||
" need_grad_cam=True,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Then using the same output function you can add the `itm=True` argument to output the new image order. Remember that for images queries, an error will be thrown with `itm=True` argument. You can also add the `image_gradcam_with_itm` along with `itm=True` argument to output the heat maps of the calculated images."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"my_obj.show_results(\n",
|
||||
" search_query[0], itm=True, image_gradcam_with_itm=image_gradcam_with_itm\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Save search results to csv"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Convert the dictionary of dictionaries into a dictionary with lists:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"outdict = ammico.append_data_to_dict(image_dict)\n",
|
||||
"df = ammico.dump_df(outdict)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Check the dataframe:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df.head(10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Write the csv file:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df.to_csv(\"/content/drive/MyDrive/misinformation-data/data_out.csv\")"
|
||||
"This module is currently under development and will be demonstrated here as soon as it is ready."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
Загрузка…
x
Ссылка в новой задаче
Block a user