AMMICO/build/doctrees/nbsphinx/notebooks/Example text.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "dcaa3da1",
   "metadata": {},
   "source": [
    "# Notebook for text extraction on image\n",
    "\n",
    "The text extraction and analysis is carried out using a variety of tools:  \n",
    "\n",
    "1. Text extraction from the image using [google-cloud-vision](https://cloud.google.com/vision)  \n",
    "1. Language detection of the extracted text using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/)  \n",
    "1. Translation into English or other languages using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
    "1. Cleaning of the text using [spacy](https://spacy.io/) \n",
    "1. Spell-check using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
    "1. Subjectivity analysis using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
    "1. Text summarization using [transformers](https://huggingface.co/docs/transformers/index) pipelines\n",
    "1. Sentiment analysis using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
    "1. Named entity recognition using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
    "1. Topic analysis using [BERTopic](https://github.com/MaartenGr/BERTopic) \n",
    "\n",
    "The first cell is only run on google colab and installs the [ammico](https://github.com/ssciwr/AMMICO) package.\n",
    "\n",
    "After that, we can import `ammico` and read in the files given a folder path."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f43f327c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:20:46.163647Z",
     "iopub.status.busy": "2023-10-20T12:20:46.163348Z",
     "iopub.status.idle": "2023-10-20T12:20:46.172494Z",
     "shell.execute_reply": "2023-10-20T12:20:46.171732Z"
    }
   },
   "outputs": [],
   "source": [
    "# if running on google colab\n",
    "# flake8-noqa-cell\n",
    "import os\n",
    "\n",
    "if \"google.colab\" in str(get_ipython()):\n",
    "    # update python version\n",
    "    # install setuptools\n",
    "    # %pip install setuptools==61 -qqq\n",
    "    # install ammico\n",
    "    %pip install git+https://github.com/ssciwr/ammico.git -qqq\n",
    "    # mount google drive for data and API key\n",
    "    from google.colab import drive\n",
    "\n",
    "    drive.mount(\"/content/drive\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "cf362e60",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:20:46.177023Z",
     "iopub.status.busy": "2023-10-20T12:20:46.176316Z",
     "iopub.status.idle": "2023-10-20T12:21:03.519993Z",
     "shell.execute_reply": "2023-10-20T12:21:03.519212Z"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import ammico\n",
    "from ammico import utils as mutils\n",
    "from ammico import display as mdisplay"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fddba721",
   "metadata": {},
   "source": [
    "We select a subset of image files to try the text extraction on, see the `limit` keyword. The `find_files` function finds image files within a given directory: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "27675810",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:03.526923Z",
     "iopub.status.busy": "2023-10-20T12:21:03.525805Z",
     "iopub.status.idle": "2023-10-20T12:21:03.531640Z",
     "shell.execute_reply": "2023-10-20T12:21:03.531019Z"
    }
   },
   "outputs": [],
   "source": [
    "# Here you need to provide the path to your google drive folder\n",
    "# or local folder containing the images\n",
    "images = mutils.find_files(\n",
    "    path=\"data/\",\n",
    "    limit=10,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3a7dfe11",
   "metadata": {},
   "source": [
    "We need to initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8b32409f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:03.535641Z",
     "iopub.status.busy": "2023-10-20T12:21:03.535192Z",
     "iopub.status.idle": "2023-10-20T12:21:03.539393Z",
     "shell.execute_reply": "2023-10-20T12:21:03.538742Z"
    }
   },
   "outputs": [],
   "source": [
    "mydict = mutils.initialize_dict(images)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b8b929f",
   "metadata": {},
   "source": [
    "## Google cloud vision API\n",
    "\n",
    "For this you need an API key and have the app activated in your google console. The first 1000 images per month are free (July 2022)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986",
   "metadata": {},
   "source": [
    "```\n",
    "os.environ[\n",
    "    \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
    "] = \"your-credentials.json\"\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0891b795-c7fe-454c-a45d-45fadf788142",
   "metadata": {},
   "source": [
    "## Inspect the elements per image\n",
    "To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below.\n",
    "Here, we display the text extraction and translation results provided by the above libraries. Click on the tabs to see the results in the right sidebar. You may need to increment the `port` number if you are already running several notebook instances on the same server."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "7c6ecc88",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:03.543363Z",
     "iopub.status.busy": "2023-10-20T12:21:03.542918Z",
     "iopub.status.idle": "2023-10-20T12:21:04.819592Z",
     "shell.execute_reply": "2023-10-20T12:21:04.818855Z"
    }
   },
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "__init__() got an unexpected keyword argument 'identify'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[5], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m analysis_explorer \u001b[38;5;241m=\u001b[39m \u001b[43mmdisplay\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mAnalysisExplorer\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43midentify\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mtext-on-image\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m      2\u001b[0m analysis_explorer\u001b[38;5;241m.\u001b[39mrun_server(port\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m8054\u001b[39m)\n",
      "\u001b[0;31mTypeError\u001b[0m: __init__() got an unexpected keyword argument 'identify'"
     ]
    }
   ],
   "source": [
    "analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify=\"text-on-image\")\n",
    "analysis_explorer.run_server(port=8054)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9c3e72b5-0e57-4019-b45e-3e36a74e7f52",
   "metadata": {},
   "source": [
    "## Or directly analyze for further processing\n",
    "Instead of inspecting each of the images, you can also directly carry out the analysis and export the result into a csv. This may take a while depending on how many images you have loaded. Set the keyword `analyse_text` to `True` if you want the text to be analyzed (spell check, subjectivity, text summary, sentiment, NER)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "365c78b1-7ff4-4213-86fa-6a0a2d05198f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:04.824811Z",
     "iopub.status.busy": "2023-10-20T12:21:04.824289Z",
     "iopub.status.idle": "2023-10-20T12:21:12.540436Z",
     "shell.execute_reply": "2023-10-20T12:21:12.539186Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting en-core-web-md==3.7.0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.0/en_core_web_md-3.7.0-py3-none-any.whl (42.8 MB)\n",
      "\u001b[?25l     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/42.8 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.1/42.8 MB\u001b[0m \u001b[31m4.2 MB/s\u001b[0m eta \u001b[36m0:00:11\u001b[0m\r",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.5/42.8 MB\u001b[0m \u001b[31m7.7 MB/s\u001b[0m eta \u001b[36m0:00:06\u001b[0m\r",
      "\u001b[2K     \u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/42.8 MB\u001b[0m \u001b[31m9.4 MB/s\u001b[0m eta \u001b[36m0:00:05\u001b[0m\r",
      "\u001b[2K     \u001b[91m━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.5/42.8 MB\u001b[0m \u001b[31m11.0 MB/s\u001b[0m eta \u001b[36m0:00:04\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.2/42.8 MB\u001b[0m \u001b[31m12.4 MB/s\u001b[0m eta \u001b[36m0:00:04\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.9/42.8 MB\u001b[0m \u001b[31m13.7 MB/s\u001b[0m eta \u001b[36m0:00:03\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.9/42.8 MB\u001b[0m \u001b[31m15.6 MB/s\u001b[0m eta \u001b[36m0:00:03\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.0/42.8 MB\u001b[0m \u001b[31m17.8 MB/s\u001b[0m eta \u001b[36m0:00:03\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.4/42.8 MB\u001b[0m \u001b[31m20.1 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m8.0/42.8 MB\u001b[0m \u001b[31m22.7 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m10.1/42.8 MB\u001b[0m \u001b[31m25.9 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.1/42.8 MB\u001b[0m \u001b[31m39.5 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m15.0/42.8 MB\u001b[0m \u001b[31m54.9 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m18.7/42.8 MB\u001b[0m \u001b[31m73.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m23.1/42.8 MB\u001b[0m \u001b[31m105.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━\u001b[0m \u001b[32m25.7/42.8 MB\u001b[0m \u001b[31m99.0 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━\u001b[0m \u001b[32m30.2/42.8 MB\u001b[0m \u001b[31m107.3 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━\u001b[0m \u001b[32m35.0/42.8 MB\u001b[0m \u001b[31m107.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━\u001b[0m \u001b[32m40.3/42.8 MB\u001b[0m \u001b[31m138.5 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m146.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m146.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m146.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m57.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25h"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: spacy<3.8.0,>=3.7.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from en-core-web-md==3.7.0) (3.7.2)\n",
      "Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.0.12)\n",
      "Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.0.5)\n",
      "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.0.10)\n",
      "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.0.8)\n",
      "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.0.9)\n",
      "Requirement already satisfied: thinc<8.3.0,>=8.1.8 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (8.2.1)\n",
      "Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.1.2)\n",
      "Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.4.8)\n",
      "Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.0.10)\n",
      "Requirement already satisfied: weasel<0.4.0,>=0.1.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.3.3)\n",
      "Requirement already satisfied: typer<0.10.0,>=0.3.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.9.0)\n",
      "Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (6.4.0)\n",
      "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (4.66.1)\n",
      "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.31.0)\n",
      "Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.10.13)\n",
      "Requirement already satisfied: jinja2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.1.2)\n",
      "Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (58.1.0)\n",
      "Requirement already satisfied: packaging>=20.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (23.2)\n",
      "Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.3.0)\n",
      "Requirement already satisfied: numpy>=1.19.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.23.4)\n",
      "Requirement already satisfied: typing-extensions>=4.2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (4.5.0)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.3.0)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.10)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.0.7)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2023.7.22)\n",
      "Requirement already satisfied: blis<0.8.0,>=0.7.8 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from thinc<8.3.0,>=8.1.8->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.7.11)\n",
      "Requirement already satisfied: confection<1.0.0,>=0.0.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from thinc<8.3.0,>=8.1.8->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.1.3)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: click<9.0.0,>=7.1.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from typer<0.10.0,>=0.3.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (8.1.7)\n",
      "Requirement already satisfied: cloudpathlib<0.17.0,>=0.7.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from weasel<0.4.0,>=0.1.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.16.0)\n",
      "Requirement already satisfied: MarkupSafe>=2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from jinja2->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.1.3)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Installing collected packages: en-core-web-md\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully installed en-core-web-md-3.7.0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
      "You can now load the package via spacy.load('en_core_web_md')\n"
     ]
    },
    {
     "ename": "FileNotFoundError",
     "evalue": "[Errno 2] No such file or directory: '102141_2_eng'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mFileNotFoundError\u001b[0m                         Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[6], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m key \u001b[38;5;129;01min\u001b[39;00m mydict:\n\u001b[0;32m----> 2\u001b[0m     mydict[key] \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mTextDetector\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m      3\u001b[0m \u001b[43m        \u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m[\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43manalyse_text\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\n\u001b[1;32m      4\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43manalyse_image\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:158\u001b[0m, in \u001b[0;36mTextDetector.analyse_image\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    152\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21manalyse_image\u001b[39m(\u001b[38;5;28mself\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mdict\u001b[39m:\n\u001b[1;32m    153\u001b[0m \u001b[38;5;250m    \u001b[39m\u001b[38;5;124;03m\"\"\"Perform text extraction and analysis of the text.\u001b[39;00m\n\u001b[1;32m    154\u001b[0m \n\u001b[1;32m    155\u001b[0m \u001b[38;5;124;03m    Returns:\u001b[39;00m\n\u001b[1;32m    156\u001b[0m \u001b[38;5;124;03m        dict: The updated dictionary with text analysis results.\u001b[39;00m\n\u001b[1;32m    157\u001b[0m \u001b[38;5;124;03m    \"\"\"\u001b[39;00m\n\u001b[0;32m--> 158\u001b[0m     \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_text_from_image\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    159\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtranslate_text()\n\u001b[1;32m    160\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mremove_linebreaks()\n",
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:178\u001b[0m, in \u001b[0;36mTextDetector.get_text_from_image\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    174\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m DefaultCredentialsError:\n\u001b[1;32m    175\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m DefaultCredentialsError(\n\u001b[1;32m    176\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease provide credentials for google cloud vision API, see https://cloud.google.com/docs/authentication/application-default-credentials.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    177\u001b[0m     )\n\u001b[0;32m--> 178\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[43mio\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mopen\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mrb\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m \u001b[38;5;28;01mas\u001b[39;00m image_file:\n\u001b[1;32m    179\u001b[0m     content \u001b[38;5;241m=\u001b[39m image_file\u001b[38;5;241m.\u001b[39mread()\n\u001b[1;32m    180\u001b[0m image \u001b[38;5;241m=\u001b[39m vision\u001b[38;5;241m.\u001b[39mImage(content\u001b[38;5;241m=\u001b[39mcontent)\n",
      "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '102141_2_eng'"
     ]
    }
   ],
   "source": [
    "for key in mydict:\n",
    "    mydict[key] = ammico.text.TextDetector(\n",
    "        mydict[key], analyse_text=True\n",
    "    ).analyse_image()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3c063eda",
   "metadata": {},
   "source": [
    "## Convert to dataframe and write csv\n",
    "These steps are required to convert the dictionary of dictionarys into a dictionary with lists, that can be converted into a pandas dataframe and exported to a csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "5709c2cd",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:12.545579Z",
     "iopub.status.busy": "2023-10-20T12:21:12.544933Z",
     "iopub.status.idle": "2023-10-20T12:21:13.263838Z",
     "shell.execute_reply": "2023-10-20T12:21:13.261908Z"
    }
   },
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "All arrays must be of the same length",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[7], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m outdict \u001b[38;5;241m=\u001b[39m mutils\u001b[38;5;241m.\u001b[39mappend_data_to_dict(mydict)\n\u001b[0;32m----> 2\u001b[0m df \u001b[38;5;241m=\u001b[39m \u001b[43mmutils\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdump_df\u001b[49m\u001b[43m(\u001b[49m\u001b[43moutdict\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/utils.py:222\u001b[0m, in \u001b[0;36mdump_df\u001b[0;34m(mydict)\u001b[0m\n\u001b[1;32m    220\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mdump_df\u001b[39m(mydict: \u001b[38;5;28mdict\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m DataFrame:\n\u001b[1;32m    221\u001b[0m \u001b[38;5;250m    \u001b[39m\u001b[38;5;124;03m\"\"\"Utility to dump the dictionary into a dataframe.\"\"\"\u001b[39;00m\n\u001b[0;32m--> 222\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mDataFrame\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/frame.py:1816\u001b[0m, in \u001b[0;36mDataFrame.from_dict\u001b[0;34m(cls, data, orient, dtype, columns)\u001b[0m\n\u001b[1;32m   1810\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m   1811\u001b[0m         \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mExpected \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mindex\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m, \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcolumns\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m or \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mtight\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m for orient parameter. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m   1812\u001b[0m         \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mGot \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00morient\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m instead\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m   1813\u001b[0m     )\n\u001b[1;32m   1815\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m orient \u001b[38;5;241m!=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtight\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[0;32m-> 1816\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mcls\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1817\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m   1818\u001b[0m     realdata \u001b[38;5;241m=\u001b[39m data[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdata\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n",
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/frame.py:736\u001b[0m, in \u001b[0;36mDataFrame.__init__\u001b[0;34m(self, data, index, columns, dtype, copy)\u001b[0m\n\u001b[1;32m    730\u001b[0m     mgr \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_init_mgr(\n\u001b[1;32m    731\u001b[0m         data, axes\u001b[38;5;241m=\u001b[39m{\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mindex\u001b[39m\u001b[38;5;124m\"\u001b[39m: index, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcolumns\u001b[39m\u001b[38;5;124m\"\u001b[39m: columns}, dtype\u001b[38;5;241m=\u001b[39mdtype, copy\u001b[38;5;241m=\u001b[39mcopy\n\u001b[1;32m    732\u001b[0m     )\n\u001b[1;32m    734\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, \u001b[38;5;28mdict\u001b[39m):\n\u001b[1;32m    735\u001b[0m     \u001b[38;5;66;03m# GH#38939 de facto copy defaults to False only in non-dict cases\u001b[39;00m\n\u001b[0;32m--> 736\u001b[0m     mgr \u001b[38;5;241m=\u001b[39m \u001b[43mdict_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcopy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmanager\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    737\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, ma\u001b[38;5;241m.\u001b[39mMaskedArray):\n\u001b[1;32m    738\u001b[0m     \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mma\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m mrecords\n",
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:503\u001b[0m, in \u001b[0;36mdict_to_mgr\u001b[0;34m(data, index, columns, dtype, typ, copy)\u001b[0m\n\u001b[1;32m    499\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m    500\u001b[0m         \u001b[38;5;66;03m# dtype check to exclude e.g. range objects, scalars\u001b[39;00m\n\u001b[1;32m    501\u001b[0m         arrays \u001b[38;5;241m=\u001b[39m [x\u001b[38;5;241m.\u001b[39mcopy() \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(x, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdtype\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01melse\u001b[39;00m x \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m arrays]\n\u001b[0;32m--> 503\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43marrays_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtyp\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mconsolidate\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:114\u001b[0m, in \u001b[0;36marrays_to_mgr\u001b[0;34m(arrays, columns, index, dtype, verify_integrity, typ, consolidate)\u001b[0m\n\u001b[1;32m    111\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m verify_integrity:\n\u001b[1;32m    112\u001b[0m     \u001b[38;5;66;03m# figure out the index, if necessary\u001b[39;00m\n\u001b[1;32m    113\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m index \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m--> 114\u001b[0m         index \u001b[38;5;241m=\u001b[39m \u001b[43m_extract_index\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    115\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m    116\u001b[0m         index \u001b[38;5;241m=\u001b[39m ensure_index(index)\n",
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:677\u001b[0m, in \u001b[0;36m_extract_index\u001b[0;34m(data)\u001b[0m\n\u001b[1;32m    675\u001b[0m lengths \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mset\u001b[39m(raw_lengths))\n\u001b[1;32m    676\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(lengths) \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[0;32m--> 677\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAll arrays must be of the same length\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m    679\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m have_dicts:\n\u001b[1;32m    680\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m    681\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMixing dicts with non-Series may lead to ambiguous ordering.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    682\u001b[0m     )\n",
      "\u001b[0;31mValueError\u001b[0m: All arrays must be of the same length"
     ]
    }
   ],
   "source": [
    "outdict = mutils.append_data_to_dict(mydict)\n",
    "df = mutils.dump_df(outdict)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ae182eb7",
   "metadata": {},
   "source": [
    "Check the dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "c4f05637",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:13.268530Z",
     "iopub.status.busy": "2023-10-20T12:21:13.267995Z",
     "iopub.status.idle": "2023-10-20T12:21:13.311720Z",
     "shell.execute_reply": "2023-10-20T12:21:13.310827Z"
    }
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'df' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[8], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mdf\u001b[49m\u001b[38;5;241m.\u001b[39mhead(\u001b[38;5;241m10\u001b[39m)\n",
      "\u001b[0;31mNameError\u001b[0m: name 'df' is not defined"
     ]
    }
   ],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "eedf1e47",
   "metadata": {},
   "source": [
    "Write the csv file - here you should provide a file path and file name for the csv file to be written."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "bf6c9ddb",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:13.316399Z",
     "iopub.status.busy": "2023-10-20T12:21:13.315650Z",
     "iopub.status.idle": "2023-10-20T12:21:13.361379Z",
     "shell.execute_reply": "2023-10-20T12:21:13.360605Z"
    }
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'df' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[9], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;66;03m# Write the csv\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[43mdf\u001b[49m\u001b[38;5;241m.\u001b[39mto_csv(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m./data_out.csv\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
      "\u001b[0;31mNameError\u001b[0m: name 'df' is not defined"
     ]
    }
   ],
   "source": [
    "# Write the csv\n",
    "df.to_csv(\"./data_out.csv\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4bc8ac0a",
   "metadata": {},
   "source": [
    "## Topic analysis\n",
    "The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4931941b",
   "metadata": {},
   "source": [
    "BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
    "\n",
    "You can pass which dataframe entry you would like to have analyzed. The default is `text_english`, but you could for example also select `text_summary` or `text_english_correct` setting the keyword `analyze_text` as so:\n",
    "\n",
    "`ammico.text.PostprocessText(mydict=mydict, analyze_text=\"text_summary\").analyse_topic()`\n",
    "\n",
    "### Option 1: Use the dictionary as obtained from the above analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "a3450a61",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:13.366656Z",
     "iopub.status.busy": "2023-10-20T12:21:13.366030Z",
     "iopub.status.idle": "2023-10-20T12:21:13.434827Z",
     "shell.execute_reply": "2023-10-20T12:21:13.434058Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Reading data from dict.\n"
     ]
    },
    {
     "ename": "ValueError",
     "evalue": "Please check your provided dictionary -                 no text_english text data found.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[10], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;66;03m# make a list of all the text_english entries per analysed image from the mydict variable as above\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m topic_model, topic_df, most_frequent_topics \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPostprocessText\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m      3\u001b[0m \u001b[43m    \u001b[49m\u001b[43mmydict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmydict\u001b[49m\n\u001b[1;32m      4\u001b[0m \u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39manalyse_topic()\n",
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:303\u001b[0m, in \u001b[0;36mPostprocessText.__init__\u001b[0;34m(self, mydict, use_csv, csv_path, analyze_text)\u001b[0m\n\u001b[1;32m    301\u001b[0m     \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReading data from dict.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m    302\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict \u001b[38;5;241m=\u001b[39m mydict\n\u001b[0;32m--> 303\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlist_text_english \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_text_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43manalyze_text\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    304\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39muse_csv:\n\u001b[1;32m    305\u001b[0m     \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReading data from df.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:375\u001b[0m, in \u001b[0;36mPostprocessText.get_text_dict\u001b[0;34m(self, analyze_text)\u001b[0m\n\u001b[1;32m    373\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m key \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict\u001b[38;5;241m.\u001b[39mkeys():\n\u001b[1;32m    374\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m analyze_text \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict[key]:\n\u001b[0;32m--> 375\u001b[0m         \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m    376\u001b[0m             \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease check your provided dictionary - \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m    377\u001b[0m \u001b[38;5;124m        no \u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m text data found.\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mformat(\n\u001b[1;32m    378\u001b[0m                 analyze_text\n\u001b[1;32m    379\u001b[0m             )\n\u001b[1;32m    380\u001b[0m         )\n\u001b[1;32m    381\u001b[0m     list_text_english\u001b[38;5;241m.\u001b[39mappend(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict[key][analyze_text])\n\u001b[1;32m    382\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m list_text_english\n",
      "\u001b[0;31mValueError\u001b[0m: Please check your provided dictionary -                 no text_english text data found."
     ]
    }
   ],
   "source": [
    "# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
    "topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
    "    mydict=mydict\n",
    ").analyse_topic()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "95667342",
   "metadata": {},
   "source": [
    "### Option 2: Read in a csv\n",
    "Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "5530e436",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:13.439138Z",
     "iopub.status.busy": "2023-10-20T12:21:13.437740Z",
     "iopub.status.idle": "2023-10-20T12:21:13.505594Z",
     "shell.execute_reply": "2023-10-20T12:21:13.504797Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Reading data from df.\n"
     ]
    },
    {
     "ename": "ValueError",
     "evalue": "Please check your provided dataframe -                                 no text_english text data found.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[11], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m input_file_path \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdata_out.csv\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m----> 2\u001b[0m topic_model, topic_df, most_frequent_topics \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPostprocessText\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m      3\u001b[0m \u001b[43m    \u001b[49m\u001b[43muse_csv\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcsv_path\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minput_file_path\u001b[49m\n\u001b[1;32m      4\u001b[0m \u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39manalyse_topic(return_topics\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m10\u001b[39m)\n",
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:307\u001b[0m, in \u001b[0;36mPostprocessText.__init__\u001b[0;34m(self, mydict, use_csv, csv_path, analyze_text)\u001b[0m\n\u001b[1;32m    305\u001b[0m     \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReading data from df.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m    306\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdf \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(csv_path, encoding\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mutf8\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m--> 307\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlist_text_english \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_text_df\u001b[49m\u001b[43m(\u001b[49m\u001b[43manalyze_text\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    308\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m    309\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m    310\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease provide either dictionary with textual data or \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m    311\u001b[0m \u001b[38;5;124m                      a csv file by setting `use_csv` to True and providing a \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m    312\u001b[0m \u001b[38;5;124m                     `csv_path`.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    313\u001b[0m     )\n",
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:397\u001b[0m, in \u001b[0;36mPostprocessText.get_text_df\u001b[0;34m(self, analyze_text)\u001b[0m\n\u001b[1;32m    394\u001b[0m \u001b[38;5;66;03m# use csv file to obtain dataframe and put text_english or text_summary in list\u001b[39;00m\n\u001b[1;32m    395\u001b[0m \u001b[38;5;66;03m# check that \"text_english\" or \"text_summary\" is there\u001b[39;00m\n\u001b[1;32m    396\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m analyze_text \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdf:\n\u001b[0;32m--> 397\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m    398\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease check your provided dataframe - \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m    399\u001b[0m \u001b[38;5;124m                        no \u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m text data found.\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mformat(\n\u001b[1;32m    400\u001b[0m             analyze_text\n\u001b[1;32m    401\u001b[0m         )\n\u001b[1;32m    402\u001b[0m     )\n\u001b[1;32m    403\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdf[analyze_text]\u001b[38;5;241m.\u001b[39mtolist()\n",
      "\u001b[0;31mValueError\u001b[0m: Please check your provided dataframe -                                 no text_english text data found."
     ]
    }
   ],
   "source": [
    "input_file_path = \"data_out.csv\"\n",
    "topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
    "    use_csv=True, csv_path=input_file_path\n",
    ").analyse_topic(return_topics=10)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0b6ef6d7",
   "metadata": {},
   "source": [
    "### Access frequent topics\n",
    "A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:13.509125Z",
     "iopub.status.busy": "2023-10-20T12:21:13.508664Z",
     "iopub.status.idle": "2023-10-20T12:21:13.551069Z",
     "shell.execute_reply": "2023-10-20T12:21:13.550365Z"
    }
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'topic_df' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[12], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[43mtopic_df\u001b[49m)\n",
      "\u001b[0;31mNameError\u001b[0m: name 'topic_df' is not defined"
     ]
    }
   ],
   "source": [
    "print(topic_df)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b3316770",
   "metadata": {},
   "source": [
    "### Get information for specific topic\n",
    "The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "db14fe03",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:13.554591Z",
     "iopub.status.busy": "2023-10-20T12:21:13.553971Z",
     "iopub.status.idle": "2023-10-20T12:21:13.595171Z",
     "shell.execute_reply": "2023-10-20T12:21:13.594404Z"
    }
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'most_frequent_topics' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[13], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m topic \u001b[38;5;129;01min\u001b[39;00m \u001b[43mmost_frequent_topics\u001b[49m:\n\u001b[1;32m      2\u001b[0m     \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mTopic:\u001b[39m\u001b[38;5;124m\"\u001b[39m, topic)\n",
      "\u001b[0;31mNameError\u001b[0m: name 'most_frequent_topics' is not defined"
     ]
    }
   ],
   "source": [
    "for topic in most_frequent_topics:\n",
    "    print(\"Topic:\", topic)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d10f701e",
   "metadata": {},
   "source": [
    "### Topic visualization\n",
    "The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "2331afe6",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:13.598899Z",
     "iopub.status.busy": "2023-10-20T12:21:13.598199Z",
     "iopub.status.idle": "2023-10-20T12:21:13.637533Z",
     "shell.execute_reply": "2023-10-20T12:21:13.636790Z"
    }
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'topic_model' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[14], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mtopic_model\u001b[49m\u001b[38;5;241m.\u001b[39mvisualize_topics()\n",
      "\u001b[0;31mNameError\u001b[0m: name 'topic_model' is not defined"
     ]
    }
   ],
   "source": [
    "topic_model.visualize_topics()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f4eaf353",
   "metadata": {},
   "source": [
    "### Save the model\n",
    "The model can be saved for future use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "e5e8377c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-10-20T12:21:13.641364Z",
     "iopub.status.busy": "2023-10-20T12:21:13.640744Z",
     "iopub.status.idle": "2023-10-20T12:21:13.680230Z",
     "shell.execute_reply": "2023-10-20T12:21:13.679510Z"
    }
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'topic_model' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[15], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mtopic_model\u001b[49m\u001b[38;5;241m.\u001b[39msave(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmisinfo_posts\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
      "\u001b[0;31mNameError\u001b[0m: name 'topic_model' is not defined"
     ]
    }
   ],
   "source": [
    "topic_model.save(\"misinfo_posts\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c94edb9",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  },
  "vscode": {
   "interpreter": {
    "hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}