Notebook for text extraction on image

The text extraction and analysis is carried out using a variety of tools:

  1. Text extraction from the image using google-cloud-vision

  2. Language detection of the extracted text using Googletrans

  3. Translation into English or other languages using Googletrans

  4. Cleaning of the text using spacy

  5. Spell-check using TextBlob

  6. Subjectivity analysis using TextBlob

  7. Text summarization using transformers pipelines

  8. Sentiment analysis using transformers pipelines

  9. Named entity recognition using transformers pipelines

  10. Topic analysis using BERTopic

The first cell is only run on google colab and installs the ammico package.

After that, we can import ammico and read in the files given a folder path.

[1]:
# if running on google colab
# flake8-noqa-cell
import os

if "google.colab" in str(get_ipython()):
    # update python version
    # install setuptools
    # %pip install setuptools==61 -qqq
    # install ammico
    %pip install git+https://github.com/ssciwr/ammico.git -qqq
    # mount google drive for data and API key
    from google.colab import drive

    drive.mount("/content/drive")
[2]:
import os
import ammico
from ammico import utils as mutils
from ammico import display as mdisplay

We select a subset of image files to try the text extraction on, see the limit keyword. The find_files function finds image files within a given directory:

[3]:
# Here you need to provide the path to your google drive folder
# or local folder containing the images
images = mutils.find_files(
    path="data/",
    limit=10,
)

We need to initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis:

[4]:
mydict = mutils.initialize_dict(images)

Google cloud vision API

For this you need an API key and have the app activated in your google console. The first 1000 images per month are free (July 2022).

os.environ[
    "GOOGLE_APPLICATION_CREDENTIALS"
] = "your-credentials.json"

Inspect the elements per image

To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below. Here, we display the text extraction and translation results provided by the above libraries. Click on the tabs to see the results in the right sidebar. You may need to increment the port number if you are already running several notebook instances on the same server.

[5]:
analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify="text-on-image")
analysis_explorer.run_server(port=8054)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify="text-on-image")
      2 analysis_explorer.run_server(port=8054)

TypeError: __init__() got an unexpected keyword argument 'identify'

Or directly analyze for further processing

Instead of inspecting each of the images, you can also directly carry out the analysis and export the result into a csv. This may take a while depending on how many images you have loaded. Set the keyword analyse_text to True if you want the text to be analyzed (spell check, subjectivity, text summary, sentiment, NER).

[6]:
for key in mydict:
    mydict[key] = ammico.text.TextDetector(
        mydict[key], analyse_text=True
    ).analyse_image()
Downloading (…)/a4f8f3e/config.json: 100%|██████████| 1.80k/1.80k [00:00<00:00, 743kB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.22G/1.22G [01:04<00:00, 18.9MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 22.4kB/s]
Downloading (…)e/a4f8f3e/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 4.44MB/s]
Downloading (…)e/a4f8f3e/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 3.87MB/s]
Downloading (…)/af0f99b/config.json: 100%|██████████| 629/629 [00:00<00:00, 635kB/s]
Downloading pytorch_model.bin: 100%|██████████| 268M/268M [00:08<00:00, 33.3MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 46.0kB/s]
Downloading (…)ve/af0f99b/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 11.2MB/s]
Downloading (…)/f2482bf/config.json: 100%|██████████| 998/998 [00:00<00:00, 448kB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.33G/1.33G [00:44<00:00, 29.8MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 60.0/60.0 [00:00<00:00, 58.7kB/s]
Downloading (…)ve/f2482bf/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 14.0MB/s]

Convert to dataframe and write csv

These steps are required to convert the dictionary of dictionarys into a dictionary with lists, that can be converted into a pandas dataframe and exported to a csv file.

[7]:
outdict = mutils.append_data_to_dict(mydict)
df = mutils.dump_df(outdict)

Check the dataframe:

[8]:
df.head(10)
[8]:
filename text text_language text_english text_summary sentiment sentiment_score entity entity_type
0 data/106349S_por.png NEWS URGENTE SAMSUNG AO VIVO Rio de Janeiro NO... pt NEWS URGENT SAMSUNG LIVE Rio de Janeiro NEW CO... NEW COUNTING METHOD RJ City HALL EXCLUDES 1,1... NEGATIVE 0.99 [Rio de Janeiro, C, ##IT, P, ##NA, ##LTO] [LOC, ORG, LOC, LOC, ORG, LOC]
1 data/102141_2_eng.png CORONAVIRUS QUARANTINE CORONAVIRUS OUTBREAK BE... en CORONAVIRUS QUARANTINE CORONAVIRUS OUTBREAK BE... Coronavirus QUARANTINE CORONAVIRUS OUTBREAK NEGATIVE 0.98 [CORONAVIRUS, ##AR, ##TI, ##RONAVIR, ##C, Co] [ORG, MISC, MISC, ORG, MISC, MISC]
2 data/102730_eng.png 400 DEATHS GET E-BOOK X AN Corporation ncy Ser... en 400 DEATHS GET E-BOOK X AN Corporation ncy Ser... A municipal worker sprays disinfectant on his... NEGATIVE 0.99 [AN Corporation ncy Services, Ahmedabad, RE, #... [ORG, LOC, PER, ORG]

Write the csv file - here you should provide a file path and file name for the csv file to be written.

[9]:
# Write the csv
df.to_csv("./data_out.csv")

Topic analysis

The topic analysis is carried out using BERTopic using an embedded model through a spaCy pipeline.

BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for analyse_topic(), the reason can be that your dataset is too small.

You can pass which dataframe entry you would like to have analyzed. The default is text_english, but you could for example also select text_summary or text_english_correct setting the keyword analyze_text as so:

ammico.text.PostprocessText(mydict=mydict, analyze_text="text_summary").analyse_topic()

Option 1: Use the dictionary as obtained from the above analysis.

[10]:
# make a list of all the text_english entries per analysed image from the mydict variable as above
topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(
    mydict=mydict
).analyse_topic()
Reading data from dict.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.8/42.8 MB 58.4 MB/s eta 0:00:00
Requirement already satisfied: spacy<3.6.0,>=3.5.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from en-core-web-md==3.5.0) (3.5.3)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (1.0.4)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (1.0.9)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2.0.7)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (3.0.8)
Requirement already satisfied: thinc<8.2.0,>=8.1.8 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (8.1.10)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (1.1.2)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2.4.6)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2.0.8)
Requirement already satisfied: typer<0.8.0,>=0.3.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (0.7.0)
Requirement already satisfied: pathy>=0.10.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (0.10.2)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (6.3.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (4.65.0)
Requirement already satisfied: numpy>=1.15.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (1.23.4)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (1.10.9)
Requirement already satisfied: jinja2 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (3.1.2)
Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (58.1.0)
Requirement already satisfied: packaging>=20.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (23.1)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (3.3.0)
Requirement already satisfied: typing-extensions>=4.2.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (4.6.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2023.5.7)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from thinc<8.2.0,>=8.1.8->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (0.7.9)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from thinc<8.2.0,>=8.1.8->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (0.0.4)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from typer<0.8.0,>=0.3.0->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (8.1.3)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from jinja2->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2.1.3)
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.5.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')

[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: pip install --upgrade pip
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/bertopic/_bertopic.py:2868, in BERTopic._reduce_dimensionality(self, embeddings, y, partial_fit)
   2867 try:
-> 2868     self.umap_model.fit(embeddings, y=y)
   2869 except TypeError:

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2684, in UMAP.fit(self, X, y)
   2683 if self.transform_mode == "embedding":
-> 2684     self.embedding_, aux_data = self._fit_embed_data(
   2685         self._raw_data[index],
   2686         self.n_epochs,
   2687         init,
   2688         random_state,  # JH why raw data?
   2689     )
   2690     # Assign any points that are fully disconnected from our manifold(s) to have embedding
   2691     # coordinates of np.nan.  These will be filtered by our plotting functions automatically.
   2692     # They also prevent users from being deceived a distance query to one of these points.
   2693     # Might be worth moving this into simplicial_set_embedding or _fit_embed_data

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2717, in UMAP._fit_embed_data(self, X, n_epochs, init, random_state)
   2714 """A method wrapper for simplicial_set_embedding that can be
   2715 replaced by subclasses.
   2716 """
-> 2717 return simplicial_set_embedding(
   2718     X,
   2719     self.graph_,
   2720     self.n_components,
   2721     self._initial_alpha,
   2722     self._a,
   2723     self._b,
   2724     self.repulsion_strength,
   2725     self.negative_sample_rate,
   2726     n_epochs,
   2727     init,
   2728     random_state,
   2729     self._input_distance_func,
   2730     self._metric_kwds,
   2731     self.densmap,
   2732     self._densmap_kwds,
   2733     self.output_dens,
   2734     self._output_distance_func,
   2735     self._output_metric_kwds,
   2736     self.output_metric in ("euclidean", "l2"),
   2737     self.random_state is None,
   2738     self.verbose,
   2739     tqdm_kwds=self.tqdm_kwds,
   2740 )

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:1078, in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)
   1076 elif isinstance(init, str) and init == "spectral":
   1077     # We add a little noise to avoid local minima for optimization to come
-> 1078     initialisation = spectral_layout(
   1079         data,
   1080         graph,
   1081         n_components,
   1082         random_state,
   1083         metric=metric,
   1084         metric_kwds=metric_kwds,
   1085     )
   1086     expansion = 10.0 / np.abs(initialisation).max()

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/spectral.py:332, in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
    331 if L.shape[0] < 2000000:
--> 332     eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
    333         L,
    334         k,
    335         which="SM",
    336         ncv=num_lanczos_vectors,
    337         tol=1e-4,
    338         v0=np.ones(L.shape[0]),
    339         maxiter=graph.shape[0] * 5,
    340     )
    341 else:

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1605, in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)
   1604 if issparse(A):
-> 1605     raise TypeError("Cannot use scipy.linalg.eigh for sparse A with "
   1606                     "k >= N. Use scipy.linalg.eigh(A.toarray()) or"
   1607                     " reduce k.")
   1608 if isinstance(A, LinearOperator):

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[10], line 2
      1 # make a list of all the text_english entries per analysed image from the mydict variable as above
----> 2 topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(
      3     mydict=mydict
      4 ).analyse_topic()

File ~/work/AMMICO/AMMICO/ammico/text.py:221, in PostprocessText.analyse_topic(self, return_topics)
    219 except TypeError:
    220     print("BERTopic excited with an error - maybe your dataset is too small?")
--> 221 self.topics, self.probs = self.topic_model.fit_transform(self.list_text_english)
    222 # return the topic list
    223 topic_df = self.topic_model.get_topic_info()

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/bertopic/_bertopic.py:356, in BERTopic.fit_transform(self, documents, embeddings, y)
    354 if self.seed_topic_list is not None and self.embedding_model is not None:
    355     y, embeddings = self._guided_topic_modeling(embeddings)
--> 356 umap_embeddings = self._reduce_dimensionality(embeddings, y)
    358 # Cluster reduced embeddings
    359 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/bertopic/_bertopic.py:2872, in BERTopic._reduce_dimensionality(self, embeddings, y, partial_fit)
   2869     except TypeError:
   2870         logger.info("The dimensionality reduction algorithm did not contain the `y` parameter and"
   2871                     " therefore the `y` parameter was not used")
-> 2872         self.umap_model.fit(embeddings)
   2874 umap_embeddings = self.umap_model.transform(embeddings)
   2875 logger.info("Reduced dimensionality")

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2684, in UMAP.fit(self, X, y)
   2681     print(ts(), "Construct embedding")
   2683 if self.transform_mode == "embedding":
-> 2684     self.embedding_, aux_data = self._fit_embed_data(
   2685         self._raw_data[index],
   2686         self.n_epochs,
   2687         init,
   2688         random_state,  # JH why raw data?
   2689     )
   2690     # Assign any points that are fully disconnected from our manifold(s) to have embedding
   2691     # coordinates of np.nan.  These will be filtered by our plotting functions automatically.
   2692     # They also prevent users from being deceived a distance query to one of these points.
   2693     # Might be worth moving this into simplicial_set_embedding or _fit_embed_data
   2694     disconnected_vertices = np.array(self.graph_.sum(axis=1)).flatten() == 0

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2717, in UMAP._fit_embed_data(self, X, n_epochs, init, random_state)
   2713 def _fit_embed_data(self, X, n_epochs, init, random_state):
   2714     """A method wrapper for simplicial_set_embedding that can be
   2715     replaced by subclasses.
   2716     """
-> 2717     return simplicial_set_embedding(
   2718         X,
   2719         self.graph_,
   2720         self.n_components,
   2721         self._initial_alpha,
   2722         self._a,
   2723         self._b,
   2724         self.repulsion_strength,
   2725         self.negative_sample_rate,
   2726         n_epochs,
   2727         init,
   2728         random_state,
   2729         self._input_distance_func,
   2730         self._metric_kwds,
   2731         self.densmap,
   2732         self._densmap_kwds,
   2733         self.output_dens,
   2734         self._output_distance_func,
   2735         self._output_metric_kwds,
   2736         self.output_metric in ("euclidean", "l2"),
   2737         self.random_state is None,
   2738         self.verbose,
   2739         tqdm_kwds=self.tqdm_kwds,
   2740     )

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:1078, in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)
   1073     embedding = random_state.uniform(
   1074         low=-10.0, high=10.0, size=(graph.shape[0], n_components)
   1075     ).astype(np.float32)
   1076 elif isinstance(init, str) and init == "spectral":
   1077     # We add a little noise to avoid local minima for optimization to come
-> 1078     initialisation = spectral_layout(
   1079         data,
   1080         graph,
   1081         n_components,
   1082         random_state,
   1083         metric=metric,
   1084         metric_kwds=metric_kwds,
   1085     )
   1086     expansion = 10.0 / np.abs(initialisation).max()
   1087     embedding = (initialisation * expansion).astype(
   1088         np.float32
   1089     ) + random_state.normal(
   (...)
   1092         np.float32
   1093     )

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/spectral.py:332, in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
    330 try:
    331     if L.shape[0] < 2000000:
--> 332         eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
    333             L,
    334             k,
    335             which="SM",
    336             ncv=num_lanczos_vectors,
    337             tol=1e-4,
    338             v0=np.ones(L.shape[0]),
    339             maxiter=graph.shape[0] * 5,
    340         )
    341     else:
    342         eigenvalues, eigenvectors = scipy.sparse.linalg.lobpcg(
    343             L, random_state.normal(size=(L.shape[0], k)), largest=False, tol=1e-8
    344         )

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1605, in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)
   1600 warnings.warn("k >= N for N * N square matrix. "
   1601               "Attempting to use scipy.linalg.eigh instead.",
   1602               RuntimeWarning)
   1604 if issparse(A):
-> 1605     raise TypeError("Cannot use scipy.linalg.eigh for sparse A with "
   1606                     "k >= N. Use scipy.linalg.eigh(A.toarray()) or"
   1607                     " reduce k.")
   1608 if isinstance(A, LinearOperator):
   1609     raise TypeError("Cannot use scipy.linalg.eigh for LinearOperator "
   1610                     "A with k >= N.")

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

Option 2: Read in a csv

Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images).

[11]:
input_file_path = "data_out.csv"
topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(
    use_csv=True, csv_path=input_file_path
).analyse_topic(return_topics=10)
Reading data from df.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/bertopic/_bertopic.py:2868, in BERTopic._reduce_dimensionality(self, embeddings, y, partial_fit)
   2867 try:
-> 2868     self.umap_model.fit(embeddings, y=y)
   2869 except TypeError:

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2684, in UMAP.fit(self, X, y)
   2683 if self.transform_mode == "embedding":
-> 2684     self.embedding_, aux_data = self._fit_embed_data(
   2685         self._raw_data[index],
   2686         self.n_epochs,
   2687         init,
   2688         random_state,  # JH why raw data?
   2689     )
   2690     # Assign any points that are fully disconnected from our manifold(s) to have embedding
   2691     # coordinates of np.nan.  These will be filtered by our plotting functions automatically.
   2692     # They also prevent users from being deceived a distance query to one of these points.
   2693     # Might be worth moving this into simplicial_set_embedding or _fit_embed_data

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2717, in UMAP._fit_embed_data(self, X, n_epochs, init, random_state)
   2714 """A method wrapper for simplicial_set_embedding that can be
   2715 replaced by subclasses.
   2716 """
-> 2717 return simplicial_set_embedding(
   2718     X,
   2719     self.graph_,
   2720     self.n_components,
   2721     self._initial_alpha,
   2722     self._a,
   2723     self._b,
   2724     self.repulsion_strength,
   2725     self.negative_sample_rate,
   2726     n_epochs,
   2727     init,
   2728     random_state,
   2729     self._input_distance_func,
   2730     self._metric_kwds,
   2731     self.densmap,
   2732     self._densmap_kwds,
   2733     self.output_dens,
   2734     self._output_distance_func,
   2735     self._output_metric_kwds,
   2736     self.output_metric in ("euclidean", "l2"),
   2737     self.random_state is None,
   2738     self.verbose,
   2739     tqdm_kwds=self.tqdm_kwds,
   2740 )

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:1078, in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)
   1076 elif isinstance(init, str) and init == "spectral":
   1077     # We add a little noise to avoid local minima for optimization to come
-> 1078     initialisation = spectral_layout(
   1079         data,
   1080         graph,
   1081         n_components,
   1082         random_state,
   1083         metric=metric,
   1084         metric_kwds=metric_kwds,
   1085     )
   1086     expansion = 10.0 / np.abs(initialisation).max()

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/spectral.py:332, in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
    331 if L.shape[0] < 2000000:
--> 332     eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
    333         L,
    334         k,
    335         which="SM",
    336         ncv=num_lanczos_vectors,
    337         tol=1e-4,
    338         v0=np.ones(L.shape[0]),
    339         maxiter=graph.shape[0] * 5,
    340     )
    341 else:

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1605, in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)
   1604 if issparse(A):
-> 1605     raise TypeError("Cannot use scipy.linalg.eigh for sparse A with "
   1606                     "k >= N. Use scipy.linalg.eigh(A.toarray()) or"
   1607                     " reduce k.")
   1608 if isinstance(A, LinearOperator):

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[11], line 2
      1 input_file_path = "data_out.csv"
----> 2 topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(
      3     use_csv=True, csv_path=input_file_path
      4 ).analyse_topic(return_topics=10)

File ~/work/AMMICO/AMMICO/ammico/text.py:221, in PostprocessText.analyse_topic(self, return_topics)
    219 except TypeError:
    220     print("BERTopic excited with an error - maybe your dataset is too small?")
--> 221 self.topics, self.probs = self.topic_model.fit_transform(self.list_text_english)
    222 # return the topic list
    223 topic_df = self.topic_model.get_topic_info()

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/bertopic/_bertopic.py:356, in BERTopic.fit_transform(self, documents, embeddings, y)
    354 if self.seed_topic_list is not None and self.embedding_model is not None:
    355     y, embeddings = self._guided_topic_modeling(embeddings)
--> 356 umap_embeddings = self._reduce_dimensionality(embeddings, y)
    358 # Cluster reduced embeddings
    359 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/bertopic/_bertopic.py:2872, in BERTopic._reduce_dimensionality(self, embeddings, y, partial_fit)
   2869     except TypeError:
   2870         logger.info("The dimensionality reduction algorithm did not contain the `y` parameter and"
   2871                     " therefore the `y` parameter was not used")
-> 2872         self.umap_model.fit(embeddings)
   2874 umap_embeddings = self.umap_model.transform(embeddings)
   2875 logger.info("Reduced dimensionality")

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2684, in UMAP.fit(self, X, y)
   2681     print(ts(), "Construct embedding")
   2683 if self.transform_mode == "embedding":
-> 2684     self.embedding_, aux_data = self._fit_embed_data(
   2685         self._raw_data[index],
   2686         self.n_epochs,
   2687         init,
   2688         random_state,  # JH why raw data?
   2689     )
   2690     # Assign any points that are fully disconnected from our manifold(s) to have embedding
   2691     # coordinates of np.nan.  These will be filtered by our plotting functions automatically.
   2692     # They also prevent users from being deceived a distance query to one of these points.
   2693     # Might be worth moving this into simplicial_set_embedding or _fit_embed_data
   2694     disconnected_vertices = np.array(self.graph_.sum(axis=1)).flatten() == 0

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2717, in UMAP._fit_embed_data(self, X, n_epochs, init, random_state)
   2713 def _fit_embed_data(self, X, n_epochs, init, random_state):
   2714     """A method wrapper for simplicial_set_embedding that can be
   2715     replaced by subclasses.
   2716     """
-> 2717     return simplicial_set_embedding(
   2718         X,
   2719         self.graph_,
   2720         self.n_components,
   2721         self._initial_alpha,
   2722         self._a,
   2723         self._b,
   2724         self.repulsion_strength,
   2725         self.negative_sample_rate,
   2726         n_epochs,
   2727         init,
   2728         random_state,
   2729         self._input_distance_func,
   2730         self._metric_kwds,
   2731         self.densmap,
   2732         self._densmap_kwds,
   2733         self.output_dens,
   2734         self._output_distance_func,
   2735         self._output_metric_kwds,
   2736         self.output_metric in ("euclidean", "l2"),
   2737         self.random_state is None,
   2738         self.verbose,
   2739         tqdm_kwds=self.tqdm_kwds,
   2740     )

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:1078, in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)
   1073     embedding = random_state.uniform(
   1074         low=-10.0, high=10.0, size=(graph.shape[0], n_components)
   1075     ).astype(np.float32)
   1076 elif isinstance(init, str) and init == "spectral":
   1077     # We add a little noise to avoid local minima for optimization to come
-> 1078     initialisation = spectral_layout(
   1079         data,
   1080         graph,
   1081         n_components,
   1082         random_state,
   1083         metric=metric,
   1084         metric_kwds=metric_kwds,
   1085     )
   1086     expansion = 10.0 / np.abs(initialisation).max()
   1087     embedding = (initialisation * expansion).astype(
   1088         np.float32
   1089     ) + random_state.normal(
   (...)
   1092         np.float32
   1093     )

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/spectral.py:332, in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
    330 try:
    331     if L.shape[0] < 2000000:
--> 332         eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
    333             L,
    334             k,
    335             which="SM",
    336             ncv=num_lanczos_vectors,
    337             tol=1e-4,
    338             v0=np.ones(L.shape[0]),
    339             maxiter=graph.shape[0] * 5,
    340         )
    341     else:
    342         eigenvalues, eigenvectors = scipy.sparse.linalg.lobpcg(
    343             L, random_state.normal(size=(L.shape[0], k)), largest=False, tol=1e-8
    344         )

File /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1605, in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)
   1600 warnings.warn("k >= N for N * N square matrix. "
   1601               "Attempting to use scipy.linalg.eigh instead.",
   1602               RuntimeWarning)
   1604 if issparse(A):
-> 1605     raise TypeError("Cannot use scipy.linalg.eigh for sparse A with "
   1606                     "k >= N. Use scipy.linalg.eigh(A.toarray()) or"
   1607                     " reduce k.")
   1608 if isinstance(A, LinearOperator):
   1609     raise TypeError("Cannot use scipy.linalg.eigh for LinearOperator "
   1610                     "A with k >= N.")

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

Access frequent topics

A topic of -1 stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic.

[12]:
print(topic_df)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 print(topic_df)

NameError: name 'topic_df' is not defined

Get information for specific topic

The most frequent topics can be accessed through most_frequent_topics with the most occuring topics first in the list.

[13]:
for topic in most_frequent_topics:
    print("Topic:", topic)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 for topic in most_frequent_topics:
      2     print("Topic:", topic)

NameError: name 'most_frequent_topics' is not defined

Topic visualization

The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality).

[14]:
topic_model.visualize_topics()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 topic_model.visualize_topics()

NameError: name 'topic_model' is not defined

Save the model

The model can be saved for future use.

[15]:
topic_model.save("misinfo_posts")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 topic_model.save("misinfo_posts")

NameError: name 'topic_model' is not defined
[ ]: