Notebook for text extraction on image

The text extraction and analysis is carried out using a variety of tools:

  1. Text extraction from the image using google-cloud-vision

  2. Language detection of the extracted text using Googletrans

  3. Translation into English or other languages using Googletrans

  4. Cleaning of the text using spacy

  5. Spell-check using TextBlob

  6. Subjectivity analysis using TextBlob

  7. Text summarization using transformers pipelines

  8. Sentiment analysis using transformers pipelines

  9. Named entity recognition using transformers pipelines

  10. Topic analysis using BERTopic

The first cell is only run on google colab and installs the ammico package.

After that, we can import ammico and read in the files given a folder path.

[1]:
# if running on google colab
# flake8-noqa-cell
import os

if "google.colab" in str(get_ipython()):
    # update python version
    # install setuptools
    # %pip install setuptools==61 -qqq
    # install ammico
    %pip install git+https://github.com/ssciwr/ammico.git -qqq
    # mount google drive for data and API key
    from google.colab import drive

    drive.mount("/content/drive")
[2]:
import os
import ammico
from ammico import utils as mutils
from ammico import display as mdisplay

We select a subset of image files to try the text extraction on, see the limit keyword. The find_files function finds image files within a given directory:

[3]:
# Here you need to provide the path to your google drive folder
# or local folder containing the images
images = mutils.find_files(
    path="data/",
    limit=10,
)

We need to initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis:

[4]:
mydict = mutils.initialize_dict(images)

Google cloud vision API

For this you need an API key and have the app activated in your google console. The first 1000 images per month are free (July 2022).

os.environ[
    "GOOGLE_APPLICATION_CREDENTIALS"
] = "your-credentials.json"

Inspect the elements per image

To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below. Here, we display the text extraction and translation results provided by the above libraries. Click on the tabs to see the results in the right sidebar. You may need to increment the port number if you are already running several notebook instances on the same server.

[5]:
analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify="text-on-image")
analysis_explorer.run_server(port=8054)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify="text-on-image")
      2 analysis_explorer.run_server(port=8054)

TypeError: __init__() got an unexpected keyword argument 'identify'

Or directly analyze for further processing

Instead of inspecting each of the images, you can also directly carry out the analysis and export the result into a csv. This may take a while depending on how many images you have loaded. Set the keyword analyse_text to True if you want the text to be analyzed (spell check, subjectivity, text summary, sentiment, NER).

[6]:
for key in mydict:
    mydict[key] = ammico.text.TextDetector(
        mydict[key], analyse_text=True
    ).analyse_image()
Collecting en-core-web-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.0/en_core_web_md-3.7.0-py3-none-any.whl (42.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.8/42.8 MB 57.8 MB/s eta 0:00:00
Requirement already satisfied: spacy<3.8.0,>=3.7.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from en-core-web-md==3.7.0) (3.7.2)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.0.10)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.0.8)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.0.9)
Requirement already satisfied: thinc<8.3.0,>=8.1.8 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (8.2.1)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.1.2)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.4.8)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.0.10)
Requirement already satisfied: weasel<0.4.0,>=0.1.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.3.3)
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.9.0)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (6.4.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (4.66.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.10.13)
Requirement already satisfied: jinja2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.1.2)
Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (58.1.0)
Requirement already satisfied: packaging>=20.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (23.2)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.3.0)
Requirement already satisfied: numpy>=1.19.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.23.4)
Requirement already satisfied: typing-extensions>=4.2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (4.5.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.3.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2023.7.22)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from thinc<8.3.0,>=8.1.8->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from thinc<8.3.0,>=8.1.8->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.1.3)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from typer<0.10.0,>=0.3.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (8.1.7)
Requirement already satisfied: cloudpathlib<0.17.0,>=0.7.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from weasel<0.4.0,>=0.1.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from jinja2->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.1.3)
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.0

[notice] A new release of pip is available: 23.0.1 -> 23.3
[notice] To update, run: pip install --upgrade pip
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[6], line 2
      1 for key in mydict:
----> 2     mydict[key] = ammico.text.TextDetector(
      3         mydict[key], analyse_text=True
      4     ).analyse_image()

File ~/work/AMMICO/AMMICO/ammico/text.py:158, in TextDetector.analyse_image(self)
    152 def analyse_image(self) -> dict:
    153     """Perform text extraction and analysis of the text.
    154
    155     Returns:
    156         dict: The updated dictionary with text analysis results.
    157     """
--> 158     self.get_text_from_image()
    159     self.translate_text()
    160     self.remove_linebreaks()

File ~/work/AMMICO/AMMICO/ammico/text.py:178, in TextDetector.get_text_from_image(self)
    174 except DefaultCredentialsError:
    175     raise DefaultCredentialsError(
    176         "Please provide credentials for google cloud vision API, see https://cloud.google.com/docs/authentication/application-default-credentials."
    177     )
--> 178 with io.open(path, "rb") as image_file:
    179     content = image_file.read()
    180 image = vision.Image(content=content)

FileNotFoundError: [Errno 2] No such file or directory: '102141_2_eng'

Convert to dataframe and write csv

These steps are required to convert the dictionary of dictionarys into a dictionary with lists, that can be converted into a pandas dataframe and exported to a csv file.

[7]:
outdict = mutils.append_data_to_dict(mydict)
df = mutils.dump_df(outdict)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 2
      1 outdict = mutils.append_data_to_dict(mydict)
----> 2 df = mutils.dump_df(outdict)

File ~/work/AMMICO/AMMICO/ammico/utils.py:222, in dump_df(mydict)
    220 def dump_df(mydict: dict) -> DataFrame:
    221     """Utility to dump the dictionary into a dataframe."""
--> 222     return DataFrame.from_dict(mydict)

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/frame.py:1816, in DataFrame.from_dict(cls, data, orient, dtype, columns)
   1810     raise ValueError(
   1811         f"Expected 'index', 'columns' or 'tight' for orient parameter. "
   1812         f"Got '{orient}' instead"
   1813     )
   1815 if orient != "tight":
-> 1816     return cls(data, index=index, columns=columns, dtype=dtype)
   1817 else:
   1818     realdata = data["data"]

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/frame.py:736, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    730     mgr = self._init_mgr(
    731         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    732     )
    734 elif isinstance(data, dict):
    735     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 736     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    737 elif isinstance(data, ma.MaskedArray):
    738     from numpy.ma import mrecords

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:503, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    499     else:
    500         # dtype check to exclude e.g. range objects, scalars
    501         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 503 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:114, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    111 if verify_integrity:
    112     # figure out the index, if necessary
    113     if index is None:
--> 114         index = _extract_index(arrays)
    115     else:
    116         index = ensure_index(index)

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:677, in _extract_index(data)
    675 lengths = list(set(raw_lengths))
    676 if len(lengths) > 1:
--> 677     raise ValueError("All arrays must be of the same length")
    679 if have_dicts:
    680     raise ValueError(
    681         "Mixing dicts with non-Series may lead to ambiguous ordering."
    682     )

ValueError: All arrays must be of the same length

Check the dataframe:

[8]:
df.head(10)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 df.head(10)

NameError: name 'df' is not defined

Write the csv file - here you should provide a file path and file name for the csv file to be written.

[9]:
# Write the csv
df.to_csv("./data_out.csv")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 2
      1 # Write the csv
----> 2 df.to_csv("./data_out.csv")

NameError: name 'df' is not defined

Topic analysis

The topic analysis is carried out using BERTopic using an embedded model through a spaCy pipeline.

BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for analyse_topic(), the reason can be that your dataset is too small.

You can pass which dataframe entry you would like to have analyzed. The default is text_english, but you could for example also select text_summary or text_english_correct setting the keyword analyze_text as so:

ammico.text.PostprocessText(mydict=mydict, analyze_text="text_summary").analyse_topic()

Option 1: Use the dictionary as obtained from the above analysis.

[10]:
# make a list of all the text_english entries per analysed image from the mydict variable as above
topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(
    mydict=mydict
).analyse_topic()
Reading data from dict.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 2
      1 # make a list of all the text_english entries per analysed image from the mydict variable as above
----> 2 topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(
      3     mydict=mydict
      4 ).analyse_topic()

File ~/work/AMMICO/AMMICO/ammico/text.py:303, in PostprocessText.__init__(self, mydict, use_csv, csv_path, analyze_text)
    301     print("Reading data from dict.")
    302     self.mydict = mydict
--> 303     self.list_text_english = self.get_text_dict(analyze_text)
    304 elif self.use_csv:
    305     print("Reading data from df.")

File ~/work/AMMICO/AMMICO/ammico/text.py:375, in PostprocessText.get_text_dict(self, analyze_text)
    373 for key in self.mydict.keys():
    374     if analyze_text not in self.mydict[key]:
--> 375         raise ValueError(
    376             "Please check your provided dictionary - \
    377         no {} text data found.".format(
    378                 analyze_text
    379             )
    380         )
    381     list_text_english.append(self.mydict[key][analyze_text])
    382 return list_text_english

ValueError: Please check your provided dictionary -                 no text_english text data found.

Option 2: Read in a csv

Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images).

[11]:
input_file_path = "data_out.csv"
topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(
    use_csv=True, csv_path=input_file_path
).analyse_topic(return_topics=10)
Reading data from df.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 2
      1 input_file_path = "data_out.csv"
----> 2 topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(
      3     use_csv=True, csv_path=input_file_path
      4 ).analyse_topic(return_topics=10)

File ~/work/AMMICO/AMMICO/ammico/text.py:307, in PostprocessText.__init__(self, mydict, use_csv, csv_path, analyze_text)
    305     print("Reading data from df.")
    306     self.df = pd.read_csv(csv_path, encoding="utf8")
--> 307     self.list_text_english = self.get_text_df(analyze_text)
    308 else:
    309     raise ValueError(
    310         "Please provide either dictionary with textual data or \
    311                       a csv file by setting `use_csv` to True and providing a \
    312                      `csv_path`."
    313     )

File ~/work/AMMICO/AMMICO/ammico/text.py:397, in PostprocessText.get_text_df(self, analyze_text)
    394 # use csv file to obtain dataframe and put text_english or text_summary in list
    395 # check that "text_english" or "text_summary" is there
    396 if analyze_text not in self.df:
--> 397     raise ValueError(
    398         "Please check your provided dataframe - \
    399                         no {} text data found.".format(
    400             analyze_text
    401         )
    402     )
    403 return self.df[analyze_text].tolist()

ValueError: Please check your provided dataframe -                                 no text_english text data found.

Access frequent topics

A topic of -1 stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic.

[12]:
print(topic_df)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 print(topic_df)

NameError: name 'topic_df' is not defined

Get information for specific topic

The most frequent topics can be accessed through most_frequent_topics with the most occuring topics first in the list.

[13]:
for topic in most_frequent_topics:
    print("Topic:", topic)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 for topic in most_frequent_topics:
      2     print("Topic:", topic)

NameError: name 'most_frequent_topics' is not defined

Topic visualization

The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality).

[14]:
topic_model.visualize_topics()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 topic_model.visualize_topics()

NameError: name 'topic_model' is not defined

Save the model

The model can be saved for future use.

[15]:
topic_model.save("misinfo_posts")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 topic_model.save("misinfo_posts")

NameError: name 'topic_model' is not defined
[ ]: