# Image summary and visual question answering

This notebooks shows some preliminary work on Image Captioning and Visual question answering with lavis. It is mainly meant to explore its capabilities and to decide on future research directions. We package our code into a `misinformation` package that is imported here:

In [1]:
from misinformation import utils as mutils
from misinformation import display as mdisplay
import misinformation.summary as sm

Set an image path as input file path.

In [2]:
images = mutils.find_files(
    path="data/",
    limit=10,
)

In [3]:
mydict = mutils.initialize_dict(images)

## Create captions for images and directly write to csv

Here you can choose between two models: "base" or "large"

In [4]:
obj = sm.SummaryDetector(mydict)
summary_model, summary_vis_processors = obj.load_model("base")
# summary_model, summary_vis_processors = obj.load_model("large")

  0%|          | 0.00/2.50G [00:00<?, ?B/s]

  0%|          | 4.30M/2.50G [00:00<00:59, 45.1MB/s]

  1%|          | 17.9M/2.50G [00:00<00:31, 83.9MB/s]

  1%|          | 25.7M/2.50G [00:00<00:32, 82.6MB/s]

  1%|▏         | 38.2M/2.50G [00:00<00:26, 101MB/s] 

  2%|▏         | 48.0M/2.50G [00:00<00:34, 75.4MB/s]

  2%|▏         | 62.2M/2.50G [00:00<00:27, 95.3MB/s]

  3%|▎         | 72.3M/2.50G [00:00<00:29, 89.8MB/s]

  3%|▎         | 88.0M/2.50G [00:00<00:24, 105MB/s] 

  4%|▍         | 102M/2.50G [00:01<00:22, 117MB/s] 

  5%|▍         | 117M/2.50G [00:01<00:20, 127MB/s]

  5%|▌         | 131M/2.50G [00:01<00:19, 133MB/s]

  6%|▌         | 150M/2.50G [00:01<00:16, 151MB/s]

  6%|▋         | 165M/2.50G [00:01<00:17, 146MB/s]

  7%|▋         | 183M/2.50G [00:01<00:15, 160MB/s]

  8%|▊         | 199M/2.50G [00:01<00:16, 148MB/s]

  8%|▊         | 217M/2.50G [00:01<00:15, 161MB/s]

  9%|▉         | 233M/2.50G [00:01<00:14, 163MB/s]

 10%|▉         | 250M/2.50G [00:02<00:14, 166MB/s]

 10%|█         | 269M/2.50G [00:02<00:13, 175MB/s]

 11%|█         | 286M/2.50G [00:02<00:14, 169MB/s]

 12%|█▏        | 305M/2.50G [00:02<00:13, 179MB/s]

 13%|█▎        | 324M/2.50G [00:02<00:12, 183MB/s]

 13%|█▎        | 343M/2.50G [00:02<00:12, 190MB/s]

 14%|█▍        | 362M/2.50G [00:02<00:12, 191MB/s]

 15%|█▍        | 382M/2.50G [00:02<00:11, 196MB/s]

 16%|█▌        | 401M/2.50G [00:02<00:11, 197MB/s]

 16%|█▋        | 420M/2.50G [00:02<00:11, 198MB/s]

 17%|█▋        | 440M/2.50G [00:03<00:11, 201MB/s]

 18%|█▊        | 460M/2.50G [00:03<00:10, 204MB/s]

 19%|█▊        | 479M/2.50G [00:03<00:10, 203MB/s]

 19%|█▉        | 498M/2.50G [00:03<00:10, 203MB/s]

 20%|██        | 518M/2.50G [00:03<00:10, 205MB/s]

 21%|██        | 538M/2.50G [00:03<00:10, 205MB/s]

 22%|██▏       | 557M/2.50G [00:03<00:16, 129MB/s]

 22%|██▏       | 573M/2.50G [00:03<00:16, 125MB/s]

 23%|██▎       | 592M/2.50G [00:04<00:14, 142MB/s]

 24%|██▍       | 612M/2.50G [00:04<00:13, 156MB/s]

 25%|██▍       | 631M/2.50G [00:04<00:11, 169MB/s]

 25%|██▌       | 649M/2.50G [00:04<00:15, 130MB/s]

 26%|██▌       | 669M/2.50G [00:04<00:13, 147MB/s]

 27%|██▋       | 688M/2.50G [00:04<00:12, 161MB/s]

 28%|██▊       | 706M/2.50G [00:04<00:12, 152MB/s]

 28%|██▊       | 727M/2.50G [00:04<00:11, 170MB/s]

 29%|██▉       | 745M/2.50G [00:05<00:12, 154MB/s]

 30%|██▉       | 764M/2.50G [00:05<00:11, 166MB/s]

 31%|███       | 783M/2.50G [00:05<00:10, 174MB/s]

 31%|███▏      | 802M/2.50G [00:05<00:10, 182MB/s]

 32%|███▏      | 822M/2.50G [00:05<00:09, 190MB/s]

 33%|███▎      | 842M/2.50G [00:05<00:09, 196MB/s]

 34%|███▎      | 862M/2.50G [00:05<00:09, 198MB/s]

 34%|███▍      | 881M/2.50G [00:05<00:08, 201MB/s]

 35%|███▌      | 901M/2.50G [00:05<00:08, 202MB/s]

 36%|███▌      | 920M/2.50G [00:06<00:11, 156MB/s]

 37%|███▋      | 939M/2.50G [00:06<00:10, 167MB/s]

 37%|███▋      | 959M/2.50G [00:06<00:09, 176MB/s]

 38%|███▊      | 977M/2.50G [00:06<00:09, 182MB/s]

 39%|███▉      | 998M/2.50G [00:06<00:08, 190MB/s]

 40%|███▉      | 0.99G/2.50G [00:06<00:08, 194MB/s]

 40%|████      | 1.01G/2.50G [00:06<00:08, 191MB/s]

 41%|████      | 1.03G/2.50G [00:06<00:09, 163MB/s]

 42%|████▏     | 1.05G/2.50G [00:06<00:09, 165MB/s]

 43%|████▎     | 1.07G/2.50G [00:07<00:08, 178MB/s]

 43%|████▎     | 1.08G/2.50G [00:07<00:11, 136MB/s]

 44%|████▍     | 1.10G/2.50G [00:07<00:12, 122MB/s]

 45%|████▍     | 1.12G/2.50G [00:07<00:10, 140MB/s]

 45%|████▌     | 1.14G/2.50G [00:07<00:09, 156MB/s]

 46%|████▌     | 1.15G/2.50G [00:07<00:08, 165MB/s]

 47%|████▋     | 1.17G/2.50G [00:07<00:08, 176MB/s]

 48%|████▊     | 1.19G/2.50G [00:07<00:07, 184MB/s]

 48%|████▊     | 1.21G/2.50G [00:08<00:14, 97.9MB/s]

 49%|████▉     | 1.23G/2.50G [00:08<00:11, 117MB/s] 

 50%|████▉     | 1.25G/2.50G [00:08<00:09, 135MB/s]

 51%|█████     | 1.27G/2.50G [00:08<00:08, 152MB/s]

 51%|█████▏    | 1.29G/2.50G [00:08<00:07, 167MB/s]

 52%|█████▏    | 1.31G/2.50G [00:08<00:07, 177MB/s]

 53%|█████▎    | 1.33G/2.50G [00:08<00:06, 186MB/s]

 54%|█████▎    | 1.34G/2.50G [00:09<00:07, 160MB/s]

 54%|█████▍    | 1.36G/2.50G [00:09<00:10, 116MB/s]

 55%|█████▌    | 1.38G/2.50G [00:09<00:08, 136MB/s]

 56%|█████▌    | 1.40G/2.50G [00:09<00:07, 153MB/s]

 57%|█████▋    | 1.42G/2.50G [00:09<00:06, 166MB/s]

 57%|█████▋    | 1.44G/2.50G [00:09<00:06, 176MB/s]

 58%|█████▊    | 1.46G/2.50G [00:09<00:06, 184MB/s]

 59%|█████▉    | 1.48G/2.50G [00:10<00:05, 189MB/s]

 60%|█████▉    | 1.50G/2.50G [00:10<00:05, 195MB/s]

 61%|██████    | 1.52G/2.50G [00:10<00:05, 200MB/s]

 61%|██████▏   | 1.54G/2.50G [00:10<00:05, 199MB/s]

 62%|██████▏   | 1.56G/2.50G [00:10<00:05, 202MB/s]

 63%|██████▎   | 1.57G/2.50G [00:10<00:04, 204MB/s]

 64%|██████▎   | 1.59G/2.50G [00:10<00:04, 205MB/s]

 64%|██████▍   | 1.61G/2.50G [00:10<00:04, 203MB/s]

 65%|██████▌   | 1.63G/2.50G [00:10<00:04, 205MB/s]

 66%|██████▌   | 1.65G/2.50G [00:10<00:04, 206MB/s]

 67%|██████▋   | 1.67G/2.50G [00:11<00:04, 196MB/s]

 68%|██████▊   | 1.69G/2.50G [00:11<00:04, 199MB/s]

 68%|██████▊   | 1.71G/2.50G [00:11<00:04, 201MB/s]

 69%|██████▉   | 1.73G/2.50G [00:11<00:05, 144MB/s]

 70%|██████▉   | 1.75G/2.50G [00:11<00:05, 157MB/s]

 71%|███████   | 1.77G/2.50G [00:11<00:04, 169MB/s]

 71%|███████▏  | 1.79G/2.50G [00:11<00:04, 178MB/s]

 72%|███████▏  | 1.80G/2.50G [00:11<00:04, 185MB/s]

 73%|███████▎  | 1.82G/2.50G [00:11<00:03, 193MB/s]

 74%|███████▎  | 1.84G/2.50G [00:12<00:03, 196MB/s]

 74%|███████▍  | 1.86G/2.50G [00:12<00:03, 199MB/s]

 75%|███████▌  | 1.88G/2.50G [00:12<00:03, 203MB/s]

 76%|███████▌  | 1.90G/2.50G [00:12<00:05, 127MB/s]

 77%|███████▋  | 1.92G/2.50G [00:12<00:04, 144MB/s]

 78%|███████▊  | 1.94G/2.50G [00:12<00:03, 160MB/s]

 78%|███████▊  | 1.96G/2.50G [00:13<00:05, 105MB/s]

 79%|███████▉  | 1.98G/2.50G [00:13<00:07, 78.1MB/s]

 80%|███████▉  | 2.00G/2.50G [00:13<00:05, 96.6MB/s]

 80%|████████  | 2.01G/2.50G [00:13<00:06, 86.2MB/s]

 81%|████████  | 2.03G/2.50G [00:13<00:04, 107MB/s] 

 82%|████████▏ | 2.05G/2.50G [00:14<00:04, 122MB/s]

 82%|████████▏ | 2.06G/2.50G [00:14<00:03, 130MB/s]

 83%|████████▎ | 2.08G/2.50G [00:14<00:05, 84.1MB/s]

 83%|████████▎ | 2.09G/2.50G [00:14<00:04, 95.6MB/s]

 84%|████████▍ | 2.10G/2.50G [00:14<00:04, 101MB/s] 

 85%|████████▍ | 2.12G/2.50G [00:14<00:03, 104MB/s]

 85%|████████▌ | 2.14G/2.50G [00:14<00:03, 128MB/s]

 86%|████████▌ | 2.16G/2.50G [00:15<00:02, 146MB/s]

 87%|████████▋ | 2.18G/2.50G [00:15<00:02, 161MB/s]

 88%|████████▊ | 2.19G/2.50G [00:15<00:01, 173MB/s]

 88%|████████▊ | 2.21G/2.50G [00:15<00:01, 182MB/s]

 89%|████████▉ | 2.23G/2.50G [00:15<00:01, 190MB/s]

 90%|████████▉ | 2.25G/2.50G [00:15<00:01, 194MB/s]

 91%|█████████ | 2.27G/2.50G [00:15<00:01, 197MB/s]

 91%|█████████▏| 2.29G/2.50G [00:15<00:01, 198MB/s]

 92%|█████████▏| 2.31G/2.50G [00:15<00:01, 202MB/s]

 93%|█████████▎| 2.33G/2.50G [00:16<00:01, 106MB/s]

 94%|█████████▍| 2.35G/2.50G [00:16<00:01, 124MB/s]

 95%|█████████▍| 2.37G/2.50G [00:16<00:01, 142MB/s]

 95%|█████████▌| 2.38G/2.50G [00:18<00:04, 29.3MB/s]

 96%|█████████▌| 2.40G/2.50G [00:18<00:03, 33.2MB/s]

 96%|█████████▌| 2.41G/2.50G [00:18<00:02, 38.6MB/s]

 97%|█████████▋| 2.42G/2.50G [00:18<00:01, 48.2MB/s]

 97%|█████████▋| 2.44G/2.50G [00:18<00:01, 63.2MB/s]

 98%|█████████▊| 2.45G/2.50G [00:19<00:00, 76.2MB/s]

 98%|█████████▊| 2.46G/2.50G [00:19<00:00, 85.2MB/s]

 99%|█████████▉| 2.48G/2.50G [00:19<00:00, 97.0MB/s]

100%|█████████▉| 2.49G/2.50G [00:19<00:00, 65.0MB/s]

100%|██████████| 2.50G/2.50G [00:19<00:00, 136MB/s] 




In [5]:
for key in mydict:
    mydict[key] = sm.SummaryDetector(mydict[key]).analyse_image(
        summary_model, summary_vis_processors
    )

Convert the dictionary of dictionaries into a dictionary with lists:

In [6]:
outdict = mutils.append_data_to_dict(mydict)
df = mutils.dump_df(outdict)

Check the dataframe:

In [7]:
df.head(10)

Unnamed: 0,filename,const_image_summary,3_non-deterministic summary
0,data/106349S_por.png,a man wearing a face mask while looking at a c...,[a tv with a man in white shirt and face mask ...
1,data/102730_eng.png,two people in blue coats spray disinfection a van,[two men are wearing blue jackets and standing...
2,data/102141_2_eng.png,"a collage of images including a corona sign, a...",[several pictures with different writing and a...


Write the csv file:

In [8]:
df.to_csv("./data_out.csv")

## Manually inspect the summaries

To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing.

`const_image_summary` - the permanent summarys, which does not change from run to run (analyse_image).

`3_non-deterministic summary` - 3 different summarys examples that change from run to run (analyse_image). 

In [9]:
mdisplay.explore_analysis(mydict, identify="summary")

HBox(children=(Select(layout=Layout(width='20%'), options=('106349S_por', '102730_eng', '102141_2_eng'), rows=…

## Generate answers to free-form questions about images written in natural language. 

Set the list of questions

In [10]:
list_of_questions = [
    "How many persons on the picture?",
    "Are there any politicians in the picture?",
    "Does the picture show something from medicine?",
]

In [11]:
for key in mydict:
    mydict[key] = sm.SummaryDetector(mydict[key]).analyse_questions(list_of_questions)

  0%|          | 0.00/1.35G [00:00<?, ?B/s]

  1%|          | 9.54M/1.35G [00:00<00:14, 100MB/s]

  2%|▏         | 26.0M/1.35G [00:00<00:09, 143MB/s]

  3%|▎         | 40.0M/1.35G [00:00<00:11, 120MB/s]

  4%|▍         | 56.0M/1.35G [00:00<00:11, 118MB/s]

  5%|▍         | 67.5M/1.35G [00:00<00:11, 116MB/s]

  6%|▌         | 80.0M/1.35G [00:00<00:16, 82.5MB/s]

  7%|▋         | 94.3M/1.35G [00:00<00:13, 97.6MB/s]

  8%|▊         | 105M/1.35G [00:01<00:24, 55.4MB/s] 

  9%|▉         | 126M/1.35G [00:01<00:16, 81.0MB/s]

 10%|█         | 141M/1.35G [00:01<00:13, 95.1MB/s]

 12%|█▏        | 160M/1.35G [00:01<00:10, 117MB/s] 

 13%|█▎        | 177M/1.35G [00:01<00:09, 133MB/s]

 15%|█▍        | 204M/1.35G [00:01<00:07, 170MB/s]

 17%|█▋        | 232M/1.35G [00:02<00:06, 190MB/s]

 19%|█▉        | 259M/1.35G [00:02<00:05, 214MB/s]

 21%|██        | 287M/1.35G [00:02<00:04, 236MB/s]

 23%|██▎       | 311M/1.35G [00:02<00:06, 172MB/s]

 24%|██▍       | 331M/1.35G [00:02<00:11, 94.6MB/s]

 25%|██▌       | 346M/1.35G [00:03<00:11, 95.3MB/s]

 27%|██▋       | 369M/1.35G [00:03<00:08, 119MB/s] 

 28%|██▊       | 385M/1.35G [00:03<00:08, 121MB/s]

 30%|███       | 414M/1.35G [00:03<00:06, 157MB/s]

 31%|███▏      | 433M/1.35G [00:03<00:09, 104MB/s]

 34%|███▎      | 463M/1.35G [00:03<00:06, 137MB/s]

 35%|███▌      | 488M/1.35G [00:04<00:07, 125MB/s]

 37%|███▋      | 504M/1.35G [00:04<00:07, 130MB/s]

 39%|███▉      | 535M/1.35G [00:04<00:05, 168MB/s]

 40%|████      | 556M/1.35G [00:04<00:05, 166MB/s]

 43%|████▎     | 587M/1.35G [00:04<00:04, 201MB/s]

 45%|████▍     | 615M/1.35G [00:04<00:03, 225MB/s]

 46%|████▋     | 639M/1.35G [00:04<00:03, 220MB/s]

 48%|████▊     | 666M/1.35G [00:04<00:03, 236MB/s]

 50%|█████     | 694M/1.35G [00:05<00:02, 252MB/s]

 52%|█████▏    | 720M/1.35G [00:05<00:02, 257MB/s]

 54%|█████▍    | 746M/1.35G [00:05<00:03, 186MB/s]

 56%|█████▌    | 775M/1.35G [00:05<00:02, 212MB/s]

 58%|█████▊    | 800M/1.35G [00:05<00:04, 135MB/s]

 60%|█████▉    | 827M/1.35G [00:05<00:03, 160MB/s]

 61%|██████▏   | 848M/1.35G [00:06<00:03, 153MB/s]

 64%|██████▎   | 879M/1.35G [00:06<00:02, 188MB/s]

 65%|██████▌   | 901M/1.35G [00:06<00:02, 175MB/s]

 67%|██████▋   | 929M/1.35G [00:06<00:02, 200MB/s]

 69%|██████▉   | 958M/1.35G [00:06<00:01, 224MB/s]

 72%|███████▏  | 986M/1.35G [00:06<00:01, 243MB/s]

 73%|███████▎  | 0.99G/1.35G [00:06<00:02, 192MB/s]

 75%|███████▌  | 1.01G/1.35G [00:07<00:01, 210MB/s]

 77%|███████▋  | 1.04G/1.35G [00:07<00:02, 126MB/s]

 79%|███████▉  | 1.07G/1.35G [00:07<00:01, 159MB/s]

 81%|████████▏ | 1.10G/1.35G [00:07<00:01, 190MB/s]

 84%|████████▎ | 1.13G/1.35G [00:07<00:01, 219MB/s]

 86%|████████▌ | 1.16G/1.35G [00:07<00:00, 243MB/s]

 88%|████████▊ | 1.18G/1.35G [00:07<00:00, 251MB/s]

 90%|████████▉ | 1.21G/1.35G [00:08<00:00, 259MB/s]

 92%|█████████▏| 1.24G/1.35G [00:08<00:00, 270MB/s]

 94%|█████████▍| 1.26G/1.35G [00:08<00:00, 153MB/s]

 96%|█████████▌| 1.29G/1.35G [00:08<00:00, 131MB/s]

 97%|█████████▋| 1.31G/1.35G [00:08<00:00, 125MB/s]

 99%|█████████▉| 1.34G/1.35G [00:09<00:00, 158MB/s]

100%|██████████| 1.35G/1.35G [00:09<00:00, 159MB/s]




In [12]:
mdisplay.explore_analysis(mydict, identify="summary")

HBox(children=(Select(layout=Layout(width='20%'), options=('106349S_por', '102730_eng', '102141_2_eng'), rows=…

Convert the dictionary of dictionarys into a dictionary with lists:

In [13]:
outdict2 = mutils.append_data_to_dict(mydict)
df2 = mutils.dump_df(outdict2)

In [14]:
df2.head(10)

Unnamed: 0,filename,const_image_summary,3_non-deterministic summary,How many persons on the picture?,Are there any politicians in the picture?,Does the picture show something from medicine?
0,data/106349S_por.png,a man wearing a face mask while looking at a c...,[a man wearing a face mask standing in front o...,1,yes,yes
1,data/102730_eng.png,two people in blue coats spray disinfection a van,[two men are wearing blue jackets and standing...,2,no,yes
2,data/102141_2_eng.png,"a collage of images including a corona sign, a...",[several pictures with different writing and a...,1,no,yes


In [15]:
df2.to_csv("./data_out2.csv")