Notebook for text extraction on image
The text extraction and analysis is carried out using a variety of tools:
Text extraction from the image using google-cloud-vision
Language detection of the extracted text using Googletrans
Translation into English or other languages using Googletrans
Cleaning of the text using spacy
Spell-check using TextBlob
Subjectivity analysis using TextBlob
Text summarization using transformers pipelines
Sentiment analysis using transformers pipelines
Named entity recognition using transformers pipelines
Topic analysis using BERTopic
The first cell is only run on google colab and installs the ammico package.
After that, we can import ammico and read in the files given a folder path.
[1]:
# if running on google colab
# flake8-noqa-cell
import os
if "google.colab" in str(get_ipython()):
# update python version
# install setuptools
# %pip install setuptools==61 -qqq
# install ammico
%pip install git+https://github.com/ssciwr/ammico.git -qqq
# mount google drive for data and API key
from google.colab import drive
drive.mount("/content/drive")
[2]:
import os
import ammico
from ammico import utils as mutils
from ammico import display as mdisplay
We select a subset of image files to try the text extraction on, see the limit keyword. The find_files function finds image files within a given directory:
[3]:
# Here you need to provide the path to your google drive folder
# or local folder containing the images
images = mutils.find_files(
path="data/",
limit=10,
)
We need to initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis:
[4]:
mydict = mutils.initialize_dict(images)
Google cloud vision API
For this you need an API key and have the app activated in your google console. The first 1000 images per month are free (July 2022).
os.environ[
"GOOGLE_APPLICATION_CREDENTIALS"
] = "your-credentials.json"
Inspect the elements per image
To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below. Here, we display the text extraction and translation results provided by the above libraries. Click on the tabs to see the results in the right sidebar. You may need to increment the port number if you are already running several notebook instances on the same
server.
[5]:
analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify="text-on-image")
analysis_explorer.run_server(port=8054)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[5], line 1
----> 1 analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify="text-on-image")
2 analysis_explorer.run_server(port=8054)
TypeError: __init__() got an unexpected keyword argument 'identify'
Or directly analyze for further processing
Instead of inspecting each of the images, you can also directly carry out the analysis and export the result into a csv. This may take a while depending on how many images you have loaded. Set the keyword analyse_text to True if you want the text to be analyzed (spell check, subjectivity, text summary, sentiment, NER).
[6]:
for key in mydict:
mydict[key] = ammico.text.TextDetector(
mydict[key], analyse_text=True
).analyse_image()
Collecting en-core-web-md==3.7.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.0/en_core_web_md-3.7.0-py3-none-any.whl (42.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.8/42.8 MB 57.8 MB/s eta 0:00:00
Requirement already satisfied: spacy<3.8.0,>=3.7.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from en-core-web-md==3.7.0) (3.7.2)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.0.10)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.0.8)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.0.9)
Requirement already satisfied: thinc<8.3.0,>=8.1.8 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (8.2.1)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.1.2)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.4.8)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.0.10)
Requirement already satisfied: weasel<0.4.0,>=0.1.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.3.3)
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.9.0)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (6.4.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (4.66.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.10.13)
Requirement already satisfied: jinja2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.1.2)
Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (58.1.0)
Requirement already satisfied: packaging>=20.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (23.2)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.3.0)
Requirement already satisfied: numpy>=1.19.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.23.4)
Requirement already satisfied: typing-extensions>=4.2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (4.5.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.3.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2023.7.22)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from thinc<8.3.0,>=8.1.8->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from thinc<8.3.0,>=8.1.8->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.1.3)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from typer<0.10.0,>=0.3.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (8.1.7)
Requirement already satisfied: cloudpathlib<0.17.0,>=0.7.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from weasel<0.4.0,>=0.1.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from jinja2->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.1.3)
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.0
[notice] A new release of pip is available: 23.0.1 -> 23.3
[notice] To update, run: pip install --upgrade pip
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[6], line 2
1 for key in mydict:
----> 2 mydict[key] = ammico.text.TextDetector(
3 mydict[key], analyse_text=True
4 ).analyse_image()
File ~/work/AMMICO/AMMICO/ammico/text.py:158, in TextDetector.analyse_image(self)
152 def analyse_image(self) -> dict:
153 """Perform text extraction and analysis of the text.
154
155 Returns:
156 dict: The updated dictionary with text analysis results.
157 """
--> 158 self.get_text_from_image()
159 self.translate_text()
160 self.remove_linebreaks()
File ~/work/AMMICO/AMMICO/ammico/text.py:178, in TextDetector.get_text_from_image(self)
174 except DefaultCredentialsError:
175 raise DefaultCredentialsError(
176 "Please provide credentials for google cloud vision API, see https://cloud.google.com/docs/authentication/application-default-credentials."
177 )
--> 178 with io.open(path, "rb") as image_file:
179 content = image_file.read()
180 image = vision.Image(content=content)
FileNotFoundError: [Errno 2] No such file or directory: '102141_2_eng'
Convert to dataframe and write csv
These steps are required to convert the dictionary of dictionarys into a dictionary with lists, that can be converted into a pandas dataframe and exported to a csv file.
[7]:
outdict = mutils.append_data_to_dict(mydict)
df = mutils.dump_df(outdict)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[7], line 2
1 outdict = mutils.append_data_to_dict(mydict)
----> 2 df = mutils.dump_df(outdict)
File ~/work/AMMICO/AMMICO/ammico/utils.py:222, in dump_df(mydict)
220 def dump_df(mydict: dict) -> DataFrame:
221 """Utility to dump the dictionary into a dataframe."""
--> 222 return DataFrame.from_dict(mydict)
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/frame.py:1816, in DataFrame.from_dict(cls, data, orient, dtype, columns)
1810 raise ValueError(
1811 f"Expected 'index', 'columns' or 'tight' for orient parameter. "
1812 f"Got '{orient}' instead"
1813 )
1815 if orient != "tight":
-> 1816 return cls(data, index=index, columns=columns, dtype=dtype)
1817 else:
1818 realdata = data["data"]
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/frame.py:736, in DataFrame.__init__(self, data, index, columns, dtype, copy)
730 mgr = self._init_mgr(
731 data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
732 )
734 elif isinstance(data, dict):
735 # GH#38939 de facto copy defaults to False only in non-dict cases
--> 736 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
737 elif isinstance(data, ma.MaskedArray):
738 from numpy.ma import mrecords
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:503, in dict_to_mgr(data, index, columns, dtype, typ, copy)
499 else:
500 # dtype check to exclude e.g. range objects, scalars
501 arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 503 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:114, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
111 if verify_integrity:
112 # figure out the index, if necessary
113 if index is None:
--> 114 index = _extract_index(arrays)
115 else:
116 index = ensure_index(index)
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:677, in _extract_index(data)
675 lengths = list(set(raw_lengths))
676 if len(lengths) > 1:
--> 677 raise ValueError("All arrays must be of the same length")
679 if have_dicts:
680 raise ValueError(
681 "Mixing dicts with non-Series may lead to ambiguous ordering."
682 )
ValueError: All arrays must be of the same length
Check the dataframe:
[8]:
df.head(10)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[8], line 1
----> 1 df.head(10)
NameError: name 'df' is not defined
Write the csv file - here you should provide a file path and file name for the csv file to be written.
[9]:
# Write the csv
df.to_csv("./data_out.csv")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 2
1 # Write the csv
----> 2 df.to_csv("./data_out.csv")
NameError: name 'df' is not defined
Topic analysis
The topic analysis is carried out using BERTopic using an embedded model through a spaCy pipeline.
BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for analyse_topic(), the reason can be that your dataset is too small.
You can pass which dataframe entry you would like to have analyzed. The default is text_english, but you could for example also select text_summary or text_english_correct setting the keyword analyze_text as so:
ammico.text.PostprocessText(mydict=mydict, analyze_text="text_summary").analyse_topic()
Option 1: Use the dictionary as obtained from the above analysis.
[10]:
# make a list of all the text_english entries per analysed image from the mydict variable as above
topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(
mydict=mydict
).analyse_topic()
Reading data from dict.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[10], line 2
1 # make a list of all the text_english entries per analysed image from the mydict variable as above
----> 2 topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(
3 mydict=mydict
4 ).analyse_topic()
File ~/work/AMMICO/AMMICO/ammico/text.py:303, in PostprocessText.__init__(self, mydict, use_csv, csv_path, analyze_text)
301 print("Reading data from dict.")
302 self.mydict = mydict
--> 303 self.list_text_english = self.get_text_dict(analyze_text)
304 elif self.use_csv:
305 print("Reading data from df.")
File ~/work/AMMICO/AMMICO/ammico/text.py:375, in PostprocessText.get_text_dict(self, analyze_text)
373 for key in self.mydict.keys():
374 if analyze_text not in self.mydict[key]:
--> 375 raise ValueError(
376 "Please check your provided dictionary - \
377 no {} text data found.".format(
378 analyze_text
379 )
380 )
381 list_text_english.append(self.mydict[key][analyze_text])
382 return list_text_english
ValueError: Please check your provided dictionary - no text_english text data found.
Option 2: Read in a csv
Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images).
[11]:
input_file_path = "data_out.csv"
topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(
use_csv=True, csv_path=input_file_path
).analyse_topic(return_topics=10)
Reading data from df.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[11], line 2
1 input_file_path = "data_out.csv"
----> 2 topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(
3 use_csv=True, csv_path=input_file_path
4 ).analyse_topic(return_topics=10)
File ~/work/AMMICO/AMMICO/ammico/text.py:307, in PostprocessText.__init__(self, mydict, use_csv, csv_path, analyze_text)
305 print("Reading data from df.")
306 self.df = pd.read_csv(csv_path, encoding="utf8")
--> 307 self.list_text_english = self.get_text_df(analyze_text)
308 else:
309 raise ValueError(
310 "Please provide either dictionary with textual data or \
311 a csv file by setting `use_csv` to True and providing a \
312 `csv_path`."
313 )
File ~/work/AMMICO/AMMICO/ammico/text.py:397, in PostprocessText.get_text_df(self, analyze_text)
394 # use csv file to obtain dataframe and put text_english or text_summary in list
395 # check that "text_english" or "text_summary" is there
396 if analyze_text not in self.df:
--> 397 raise ValueError(
398 "Please check your provided dataframe - \
399 no {} text data found.".format(
400 analyze_text
401 )
402 )
403 return self.df[analyze_text].tolist()
ValueError: Please check your provided dataframe - no text_english text data found.
Access frequent topics
A topic of -1 stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic.
[12]:
print(topic_df)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[12], line 1
----> 1 print(topic_df)
NameError: name 'topic_df' is not defined
Get information for specific topic
The most frequent topics can be accessed through most_frequent_topics with the most occuring topics first in the list.
[13]:
for topic in most_frequent_topics:
print("Topic:", topic)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[13], line 1
----> 1 for topic in most_frequent_topics:
2 print("Topic:", topic)
NameError: name 'most_frequent_topics' is not defined
Topic visualization
The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality).
[14]:
topic_model.visualize_topics()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 1
----> 1 topic_model.visualize_topics()
NameError: name 'topic_model' is not defined
Save the model
The model can be saved for future use.
[15]:
topic_model.save("misinfo_posts")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[15], line 1
----> 1 topic_model.save("misinfo_posts")
NameError: name 'topic_model' is not defined
[ ]: