зеркало из
				https://github.com/ssciwr/AMMICO.git
				synced 2025-11-04 07:56:04 +02:00 
			
		
		
		
	
		
			
				
	
	
		
			831 строка
		
	
	
		
			59 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			831 строка
		
	
	
		
			59 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
{
 | 
						|
 "cells": [
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "dcaa3da1",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "# Notebook for text extraction on image\n",
 | 
						|
    "\n",
 | 
						|
    "The text extraction and analysis is carried out using a variety of tools:  \n",
 | 
						|
    "\n",
 | 
						|
    "1. Text extraction from the image using [google-cloud-vision](https://cloud.google.com/vision)  \n",
 | 
						|
    "1. Language detection of the extracted text using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/)  \n",
 | 
						|
    "1. Translation into English or other languages using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
 | 
						|
    "1. Cleaning of the text using [spacy](https://spacy.io/) \n",
 | 
						|
    "1. Spell-check using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
 | 
						|
    "1. Subjectivity analysis using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
 | 
						|
    "1. Text summarization using [transformers](https://huggingface.co/docs/transformers/index) pipelines\n",
 | 
						|
    "1. Sentiment analysis using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
 | 
						|
    "1. Named entity recognition using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
 | 
						|
    "1. Topic analysis using [BERTopic](https://github.com/MaartenGr/BERTopic) \n",
 | 
						|
    "\n",
 | 
						|
    "The first cell is only run on google colab and installs the [ammico](https://github.com/ssciwr/AMMICO) package.\n",
 | 
						|
    "\n",
 | 
						|
    "After that, we can import `ammico` and read in the files given a folder path."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 1,
 | 
						|
   "id": "f43f327c",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:01:29.946229Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:01:29.945931Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:01:29.956966Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:01:29.956190Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [],
 | 
						|
   "source": [
 | 
						|
    "# if running on google colab\n",
 | 
						|
    "# flake8-noqa-cell\n",
 | 
						|
    "import os\n",
 | 
						|
    "\n",
 | 
						|
    "if \"google.colab\" in str(get_ipython()):\n",
 | 
						|
    "    # update python version\n",
 | 
						|
    "    # install setuptools\n",
 | 
						|
    "    # %pip install setuptools==61 -qqq\n",
 | 
						|
    "    # install ammico\n",
 | 
						|
    "    %pip install git+https://github.com/ssciwr/ammico.git -qqq\n",
 | 
						|
    "    # mount google drive for data and API key\n",
 | 
						|
    "    from google.colab import drive\n",
 | 
						|
    "\n",
 | 
						|
    "    drive.mount(\"/content/drive\")"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 2,
 | 
						|
   "id": "cf362e60",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:01:29.962409Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:01:29.961500Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:01:53.133043Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:01:53.132124Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [],
 | 
						|
   "source": [
 | 
						|
    "import os\n",
 | 
						|
    "import ammico\n",
 | 
						|
    "from ammico import utils as mutils\n",
 | 
						|
    "from ammico import display as mdisplay"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "fddba721",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "We select a subset of image files to try the text extraction on, see the `limit` keyword. The `find_files` function finds image files within a given directory: "
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 3,
 | 
						|
   "id": "27675810",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:01:53.141156Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:01:53.139940Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:01:53.146660Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:01:53.145902Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [],
 | 
						|
   "source": [
 | 
						|
    "# Here you need to provide the path to your google drive folder\n",
 | 
						|
    "# or local folder containing the images\n",
 | 
						|
    "images = mutils.find_files(\n",
 | 
						|
    "    path=\"data/\",\n",
 | 
						|
    "    limit=10,\n",
 | 
						|
    ")"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "3a7dfe11",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "We need to initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis:"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 4,
 | 
						|
   "id": "8b32409f",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:01:53.152119Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:01:53.151395Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:01:53.156092Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:01:53.154809Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [],
 | 
						|
   "source": [
 | 
						|
    "mydict = mutils.initialize_dict(images)"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "7b8b929f",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "## Google cloud vision API\n",
 | 
						|
    "\n",
 | 
						|
    "For this you need an API key and have the app activated in your google console. The first 1000 images per month are free (July 2022)."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "```\n",
 | 
						|
    "os.environ[\n",
 | 
						|
    "    \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
 | 
						|
    "] = \"your-credentials.json\"\n",
 | 
						|
    "```"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "0891b795-c7fe-454c-a45d-45fadf788142",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "## Inspect the elements per image\n",
 | 
						|
    "To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below.\n",
 | 
						|
    "Here, we display the text extraction and translation results provided by the above libraries. Click on the tabs to see the results in the right sidebar. You may need to increment the `port` number if you are already running several notebook instances on the same server."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 5,
 | 
						|
   "id": "7c6ecc88",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:01:53.160497Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:01:53.159837Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:01:54.534252Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:01:54.533419Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [
 | 
						|
    {
 | 
						|
     "ename": "TypeError",
 | 
						|
     "evalue": "__init__() got an unexpected keyword argument 'identify'",
 | 
						|
     "output_type": "error",
 | 
						|
     "traceback": [
 | 
						|
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 | 
						|
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
 | 
						|
      "Cell \u001b[0;32mIn[5], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m analysis_explorer \u001b[38;5;241m=\u001b[39m \u001b[43mmdisplay\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mAnalysisExplorer\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43midentify\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mtext-on-image\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m      2\u001b[0m analysis_explorer\u001b[38;5;241m.\u001b[39mrun_server(port\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m8054\u001b[39m)\n",
 | 
						|
      "\u001b[0;31mTypeError\u001b[0m: __init__() got an unexpected keyword argument 'identify'"
 | 
						|
     ]
 | 
						|
    }
 | 
						|
   ],
 | 
						|
   "source": [
 | 
						|
    "analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify=\"text-on-image\")\n",
 | 
						|
    "analysis_explorer.run_server(port=8054)"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "9c3e72b5-0e57-4019-b45e-3e36a74e7f52",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "## Or directly analyze for further processing\n",
 | 
						|
    "Instead of inspecting each of the images, you can also directly carry out the analysis and export the result into a csv. This may take a while depending on how many images you have loaded. Set the keyword `analyse_text` to `True` if you want the text to be analyzed (spell check, subjectivity, text summary, sentiment, NER)."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 6,
 | 
						|
   "id": "365c78b1-7ff4-4213-86fa-6a0a2d05198f",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:01:54.539872Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:01:54.539251Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:02:03.765707Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:02:03.764609Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "Collecting en-core-web-md==3.6.0\n"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.6.0/en_core_web_md-3.6.0-py3-none-any.whl (42.8 MB)\n",
 | 
						|
      "\u001b[?25l     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/42.8 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.1/42.8 MB\u001b[0m \u001b[31m2.1 MB/s\u001b[0m eta \u001b[36m0:00:21\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.2/42.8 MB\u001b[0m \u001b[31m3.4 MB/s\u001b[0m eta \u001b[36m0:00:13\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.0/42.8 MB\u001b[0m \u001b[31m38.0 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.3/42.8 MB\u001b[0m \u001b[31m39.6 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.3/42.8 MB\u001b[0m \u001b[31m39.6 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.3/42.8 MB\u001b[0m \u001b[31m39.6 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.3/42.8 MB\u001b[0m \u001b[31m39.6 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.3/42.8 MB\u001b[0m \u001b[31m16.2 MB/s\u001b[0m eta \u001b[36m0:00:03\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.3/42.8 MB\u001b[0m \u001b[31m16.2 MB/s\u001b[0m eta \u001b[36m0:00:03\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.8/42.8 MB\u001b[0m \u001b[31m16.2 MB/s\u001b[0m eta \u001b[36m0:00:03\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/42.8 MB\u001b[0m \u001b[31m19.9 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m9.4/42.8 MB\u001b[0m \u001b[31m21.8 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.6/42.8 MB\u001b[0m \u001b[31m27.1 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/42.8 MB\u001b[0m \u001b[31m62.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m17.3/42.8 MB\u001b[0m \u001b[31m62.6 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m19.4/42.8 MB\u001b[0m \u001b[31m62.7 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.9/42.8 MB\u001b[0m \u001b[31m65.9 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m24.1/42.8 MB\u001b[0m \u001b[31m62.5 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━\u001b[0m \u001b[32m26.5/42.8 MB\u001b[0m \u001b[31m62.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━\u001b[0m \u001b[32m28.7/42.8 MB\u001b[0m \u001b[31m62.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━\u001b[0m \u001b[32m31.0/42.8 MB\u001b[0m \u001b[31m66.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━\u001b[0m \u001b[32m33.3/42.8 MB\u001b[0m \u001b[31m64.7 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━\u001b[0m \u001b[32m35.0/42.8 MB\u001b[0m \u001b[31m59.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━\u001b[0m \u001b[32m37.9/42.8 MB\u001b[0m \u001b[31m65.3 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━\u001b[0m \u001b[32m39.8/42.8 MB\u001b[0m \u001b[31m61.9 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m42.4/42.8 MB\u001b[0m \u001b[31m63.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m59.6 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m59.6 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m59.6 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "\r",
 | 
						|
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m59.6 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
 | 
						|
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m30.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
 | 
						|
      "\u001b[?25hRequirement already satisfied: spacy<3.7.0,>=3.6.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from en-core-web-md==3.6.0) (3.6.1)\n"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (3.0.12)\n",
 | 
						|
      "Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (1.0.5)\n",
 | 
						|
      "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (1.0.10)\n",
 | 
						|
      "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (2.0.8)\n",
 | 
						|
      "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (3.0.9)\n",
 | 
						|
      "Requirement already satisfied: thinc<8.2.0,>=8.1.8 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (8.1.12)\n",
 | 
						|
      "Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (1.1.2)\n",
 | 
						|
      "Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (2.4.7)\n",
 | 
						|
      "Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (2.0.9)\n",
 | 
						|
      "Requirement already satisfied: typer<0.10.0,>=0.3.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (0.9.0)\n",
 | 
						|
      "Requirement already satisfied: pathy>=0.10.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (0.10.2)\n",
 | 
						|
      "Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (6.4.0)\n",
 | 
						|
      "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (4.66.1)\n",
 | 
						|
      "Requirement already satisfied: numpy>=1.15.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (1.23.4)\n",
 | 
						|
      "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (2.31.0)\n",
 | 
						|
      "Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (1.10.12)\n",
 | 
						|
      "Requirement already satisfied: jinja2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (3.1.2)\n",
 | 
						|
      "Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (58.1.0)\n",
 | 
						|
      "Requirement already satisfied: packaging>=20.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (23.1)\n",
 | 
						|
      "Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (3.3.0)\n",
 | 
						|
      "Requirement already satisfied: typing-extensions>=4.2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (4.5.0)\n",
 | 
						|
      "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (3.2.0)\n",
 | 
						|
      "Requirement already satisfied: idna<4,>=2.5 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (2.10)\n",
 | 
						|
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (1.26.16)\n",
 | 
						|
      "Requirement already satisfied: certifi>=2017.4.17 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (2023.7.22)\n",
 | 
						|
      "Requirement already satisfied: blis<0.8.0,>=0.7.8 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from thinc<8.2.0,>=8.1.8->spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (0.7.10)\n",
 | 
						|
      "Requirement already satisfied: confection<1.0.0,>=0.0.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from thinc<8.2.0,>=8.1.8->spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (0.1.3)\n"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "Requirement already satisfied: click<9.0.0,>=7.1.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from typer<0.10.0,>=0.3.0->spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (8.1.7)\n",
 | 
						|
      "Requirement already satisfied: MarkupSafe>=2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from jinja2->spacy<3.7.0,>=3.6.0->en-core-web-md==3.6.0) (2.1.3)\n"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "Installing collected packages: en-core-web-md\n"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "Successfully installed en-core-web-md-3.6.0\n"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "name": "stderr",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "\n",
 | 
						|
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.2.1\u001b[0m\n",
 | 
						|
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
 | 
						|
      "You can now load the package via spacy.load('en_core_web_md')\n"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "ename": "FileNotFoundError",
 | 
						|
     "evalue": "[Errno 2] No such file or directory: '102141_2_eng'",
 | 
						|
     "output_type": "error",
 | 
						|
     "traceback": [
 | 
						|
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 | 
						|
      "\u001b[0;31mFileNotFoundError\u001b[0m                         Traceback (most recent call last)",
 | 
						|
      "Cell \u001b[0;32mIn[6], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m key \u001b[38;5;129;01min\u001b[39;00m mydict:\n\u001b[0;32m----> 2\u001b[0m     mydict[key] \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mTextDetector\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m      3\u001b[0m \u001b[43m        \u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m[\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43manalyse_text\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\n\u001b[1;32m      4\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43manalyse_image\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
 | 
						|
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:158\u001b[0m, in \u001b[0;36mTextDetector.analyse_image\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    152\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21manalyse_image\u001b[39m(\u001b[38;5;28mself\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mdict\u001b[39m:\n\u001b[1;32m    153\u001b[0m \u001b[38;5;250m    \u001b[39m\u001b[38;5;124;03m\"\"\"Perform text extraction and analysis of the text.\u001b[39;00m\n\u001b[1;32m    154\u001b[0m \n\u001b[1;32m    155\u001b[0m \u001b[38;5;124;03m    Returns:\u001b[39;00m\n\u001b[1;32m    156\u001b[0m \u001b[38;5;124;03m        dict: The updated dictionary with text analysis results.\u001b[39;00m\n\u001b[1;32m    157\u001b[0m \u001b[38;5;124;03m    \"\"\"\u001b[39;00m\n\u001b[0;32m--> 158\u001b[0m     \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_text_from_image\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    159\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtranslate_text()\n\u001b[1;32m    160\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mremove_linebreaks()\n",
 | 
						|
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:178\u001b[0m, in \u001b[0;36mTextDetector.get_text_from_image\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    174\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m DefaultCredentialsError:\n\u001b[1;32m    175\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m DefaultCredentialsError(\n\u001b[1;32m    176\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease provide credentials for google cloud vision API, see https://cloud.google.com/docs/authentication/application-default-credentials.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    177\u001b[0m     )\n\u001b[0;32m--> 178\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[43mio\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mopen\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mrb\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m \u001b[38;5;28;01mas\u001b[39;00m image_file:\n\u001b[1;32m    179\u001b[0m     content \u001b[38;5;241m=\u001b[39m image_file\u001b[38;5;241m.\u001b[39mread()\n\u001b[1;32m    180\u001b[0m image \u001b[38;5;241m=\u001b[39m vision\u001b[38;5;241m.\u001b[39mImage(content\u001b[38;5;241m=\u001b[39mcontent)\n",
 | 
						|
      "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '102141_2_eng'"
 | 
						|
     ]
 | 
						|
    }
 | 
						|
   ],
 | 
						|
   "source": [
 | 
						|
    "for key in mydict:\n",
 | 
						|
    "    mydict[key] = ammico.text.TextDetector(\n",
 | 
						|
    "        mydict[key], analyse_text=True\n",
 | 
						|
    "    ).analyse_image()"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "3c063eda",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "## Convert to dataframe and write csv\n",
 | 
						|
    "These steps are required to convert the dictionary of dictionarys into a dictionary with lists, that can be converted into a pandas dataframe and exported to a csv file."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 7,
 | 
						|
   "id": "5709c2cd",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:02:03.770904Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:02:03.770195Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:02:04.639445Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:02:04.638504Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [
 | 
						|
    {
 | 
						|
     "ename": "ValueError",
 | 
						|
     "evalue": "All arrays must be of the same length",
 | 
						|
     "output_type": "error",
 | 
						|
     "traceback": [
 | 
						|
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 | 
						|
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
 | 
						|
      "Cell \u001b[0;32mIn[7], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m outdict \u001b[38;5;241m=\u001b[39m mutils\u001b[38;5;241m.\u001b[39mappend_data_to_dict(mydict)\n\u001b[0;32m----> 2\u001b[0m df \u001b[38;5;241m=\u001b[39m \u001b[43mmutils\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdump_df\u001b[49m\u001b[43m(\u001b[49m\u001b[43moutdict\u001b[49m\u001b[43m)\u001b[49m\n",
 | 
						|
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/utils.py:222\u001b[0m, in \u001b[0;36mdump_df\u001b[0;34m(mydict)\u001b[0m\n\u001b[1;32m    220\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mdump_df\u001b[39m(mydict: \u001b[38;5;28mdict\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m DataFrame:\n\u001b[1;32m    221\u001b[0m \u001b[38;5;250m    \u001b[39m\u001b[38;5;124;03m\"\"\"Utility to dump the dictionary into a dataframe.\"\"\"\u001b[39;00m\n\u001b[0;32m--> 222\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mDataFrame\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m)\u001b[49m\n",
 | 
						|
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/frame.py:1816\u001b[0m, in \u001b[0;36mDataFrame.from_dict\u001b[0;34m(cls, data, orient, dtype, columns)\u001b[0m\n\u001b[1;32m   1810\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m   1811\u001b[0m         \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mExpected \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mindex\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m, \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcolumns\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m or \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mtight\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m for orient parameter. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m   1812\u001b[0m         \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mGot \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00morient\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m instead\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m   1813\u001b[0m     )\n\u001b[1;32m   1815\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m orient \u001b[38;5;241m!=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtight\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[0;32m-> 1816\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mcls\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1817\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m   1818\u001b[0m     realdata \u001b[38;5;241m=\u001b[39m data[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdata\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n",
 | 
						|
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/frame.py:736\u001b[0m, in \u001b[0;36mDataFrame.__init__\u001b[0;34m(self, data, index, columns, dtype, copy)\u001b[0m\n\u001b[1;32m    730\u001b[0m     mgr \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_init_mgr(\n\u001b[1;32m    731\u001b[0m         data, axes\u001b[38;5;241m=\u001b[39m{\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mindex\u001b[39m\u001b[38;5;124m\"\u001b[39m: index, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcolumns\u001b[39m\u001b[38;5;124m\"\u001b[39m: columns}, dtype\u001b[38;5;241m=\u001b[39mdtype, copy\u001b[38;5;241m=\u001b[39mcopy\n\u001b[1;32m    732\u001b[0m     )\n\u001b[1;32m    734\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, \u001b[38;5;28mdict\u001b[39m):\n\u001b[1;32m    735\u001b[0m     \u001b[38;5;66;03m# GH#38939 de facto copy defaults to False only in non-dict cases\u001b[39;00m\n\u001b[0;32m--> 736\u001b[0m     mgr \u001b[38;5;241m=\u001b[39m \u001b[43mdict_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcopy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmanager\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    737\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, ma\u001b[38;5;241m.\u001b[39mMaskedArray):\n\u001b[1;32m    738\u001b[0m     \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mma\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m mrecords\n",
 | 
						|
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:503\u001b[0m, in \u001b[0;36mdict_to_mgr\u001b[0;34m(data, index, columns, dtype, typ, copy)\u001b[0m\n\u001b[1;32m    499\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m    500\u001b[0m         \u001b[38;5;66;03m# dtype check to exclude e.g. range objects, scalars\u001b[39;00m\n\u001b[1;32m    501\u001b[0m         arrays \u001b[38;5;241m=\u001b[39m [x\u001b[38;5;241m.\u001b[39mcopy() \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(x, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdtype\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01melse\u001b[39;00m x \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m arrays]\n\u001b[0;32m--> 503\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43marrays_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtyp\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mconsolidate\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m)\u001b[49m\n",
 | 
						|
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:114\u001b[0m, in \u001b[0;36marrays_to_mgr\u001b[0;34m(arrays, columns, index, dtype, verify_integrity, typ, consolidate)\u001b[0m\n\u001b[1;32m    111\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m verify_integrity:\n\u001b[1;32m    112\u001b[0m     \u001b[38;5;66;03m# figure out the index, if necessary\u001b[39;00m\n\u001b[1;32m    113\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m index \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m--> 114\u001b[0m         index \u001b[38;5;241m=\u001b[39m \u001b[43m_extract_index\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    115\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m    116\u001b[0m         index \u001b[38;5;241m=\u001b[39m ensure_index(index)\n",
 | 
						|
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:677\u001b[0m, in \u001b[0;36m_extract_index\u001b[0;34m(data)\u001b[0m\n\u001b[1;32m    675\u001b[0m lengths \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mset\u001b[39m(raw_lengths))\n\u001b[1;32m    676\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(lengths) \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[0;32m--> 677\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAll arrays must be of the same length\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m    679\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m have_dicts:\n\u001b[1;32m    680\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m    681\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMixing dicts with non-Series may lead to ambiguous ordering.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    682\u001b[0m     )\n",
 | 
						|
      "\u001b[0;31mValueError\u001b[0m: All arrays must be of the same length"
 | 
						|
     ]
 | 
						|
    }
 | 
						|
   ],
 | 
						|
   "source": [
 | 
						|
    "outdict = mutils.append_data_to_dict(mydict)\n",
 | 
						|
    "df = mutils.dump_df(outdict)"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "ae182eb7",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "Check the dataframe:"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 8,
 | 
						|
   "id": "c4f05637",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:02:04.645629Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:02:04.645034Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:02:04.696080Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:02:04.695096Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [
 | 
						|
    {
 | 
						|
     "ename": "NameError",
 | 
						|
     "evalue": "name 'df' is not defined",
 | 
						|
     "output_type": "error",
 | 
						|
     "traceback": [
 | 
						|
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 | 
						|
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
 | 
						|
      "Cell \u001b[0;32mIn[8], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mdf\u001b[49m\u001b[38;5;241m.\u001b[39mhead(\u001b[38;5;241m10\u001b[39m)\n",
 | 
						|
      "\u001b[0;31mNameError\u001b[0m: name 'df' is not defined"
 | 
						|
     ]
 | 
						|
    }
 | 
						|
   ],
 | 
						|
   "source": [
 | 
						|
    "df.head(10)"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "eedf1e47",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "Write the csv file - here you should provide a file path and file name for the csv file to be written."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 9,
 | 
						|
   "id": "bf6c9ddb",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:02:04.700162Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:02:04.699419Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:02:04.752126Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:02:04.751103Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [
 | 
						|
    {
 | 
						|
     "ename": "NameError",
 | 
						|
     "evalue": "name 'df' is not defined",
 | 
						|
     "output_type": "error",
 | 
						|
     "traceback": [
 | 
						|
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 | 
						|
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
 | 
						|
      "Cell \u001b[0;32mIn[9], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;66;03m# Write the csv\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[43mdf\u001b[49m\u001b[38;5;241m.\u001b[39mto_csv(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m./data_out.csv\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
 | 
						|
      "\u001b[0;31mNameError\u001b[0m: name 'df' is not defined"
 | 
						|
     ]
 | 
						|
    }
 | 
						|
   ],
 | 
						|
   "source": [
 | 
						|
    "# Write the csv\n",
 | 
						|
    "df.to_csv(\"./data_out.csv\")"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "4bc8ac0a",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "## Topic analysis\n",
 | 
						|
    "The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "4931941b",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
 | 
						|
    "\n",
 | 
						|
    "You can pass which dataframe entry you would like to have analyzed. The default is `text_english`, but you could for example also select `text_summary` or `text_english_correct` setting the keyword `analyze_text` as so:\n",
 | 
						|
    "\n",
 | 
						|
    "`ammico.text.PostprocessText(mydict=mydict, analyze_text=\"text_summary\").analyse_topic()`\n",
 | 
						|
    "\n",
 | 
						|
    "### Option 1: Use the dictionary as obtained from the above analysis."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 10,
 | 
						|
   "id": "a3450a61",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:02:04.756944Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:02:04.756221Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:02:04.826491Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:02:04.825575Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "Reading data from dict.\n"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "ename": "ValueError",
 | 
						|
     "evalue": "Please check your provided dictionary -                 no text_english text data found.",
 | 
						|
     "output_type": "error",
 | 
						|
     "traceback": [
 | 
						|
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 | 
						|
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
 | 
						|
      "Cell \u001b[0;32mIn[10], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;66;03m# make a list of all the text_english entries per analysed image from the mydict variable as above\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m topic_model, topic_df, most_frequent_topics \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPostprocessText\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m      3\u001b[0m \u001b[43m    \u001b[49m\u001b[43mmydict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmydict\u001b[49m\n\u001b[1;32m      4\u001b[0m \u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39manalyse_topic()\n",
 | 
						|
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:303\u001b[0m, in \u001b[0;36mPostprocessText.__init__\u001b[0;34m(self, mydict, use_csv, csv_path, analyze_text)\u001b[0m\n\u001b[1;32m    301\u001b[0m     \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReading data from dict.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m    302\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict \u001b[38;5;241m=\u001b[39m mydict\n\u001b[0;32m--> 303\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlist_text_english \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_text_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43manalyze_text\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    304\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39muse_csv:\n\u001b[1;32m    305\u001b[0m     \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReading data from df.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
 | 
						|
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:375\u001b[0m, in \u001b[0;36mPostprocessText.get_text_dict\u001b[0;34m(self, analyze_text)\u001b[0m\n\u001b[1;32m    373\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m key \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict\u001b[38;5;241m.\u001b[39mkeys():\n\u001b[1;32m    374\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m analyze_text \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict[key]:\n\u001b[0;32m--> 375\u001b[0m         \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m    376\u001b[0m             \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease check your provided dictionary - \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m    377\u001b[0m \u001b[38;5;124m        no \u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m text data found.\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mformat(\n\u001b[1;32m    378\u001b[0m                 analyze_text\n\u001b[1;32m    379\u001b[0m             )\n\u001b[1;32m    380\u001b[0m         )\n\u001b[1;32m    381\u001b[0m     list_text_english\u001b[38;5;241m.\u001b[39mappend(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict[key][analyze_text])\n\u001b[1;32m    382\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m list_text_english\n",
 | 
						|
      "\u001b[0;31mValueError\u001b[0m: Please check your provided dictionary -                 no text_english text data found."
 | 
						|
     ]
 | 
						|
    }
 | 
						|
   ],
 | 
						|
   "source": [
 | 
						|
    "# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
 | 
						|
    "topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
 | 
						|
    "    mydict=mydict\n",
 | 
						|
    ").analyse_topic()"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "95667342",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "### Option 2: Read in a csv\n",
 | 
						|
    "Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 11,
 | 
						|
   "id": "5530e436",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:02:04.830533Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:02:04.829989Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:02:04.904568Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:02:04.903442Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [
 | 
						|
    {
 | 
						|
     "name": "stdout",
 | 
						|
     "output_type": "stream",
 | 
						|
     "text": [
 | 
						|
      "Reading data from df.\n"
 | 
						|
     ]
 | 
						|
    },
 | 
						|
    {
 | 
						|
     "ename": "ValueError",
 | 
						|
     "evalue": "Please check your provided dataframe -                                 no text_english text data found.",
 | 
						|
     "output_type": "error",
 | 
						|
     "traceback": [
 | 
						|
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 | 
						|
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
 | 
						|
      "Cell \u001b[0;32mIn[11], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m input_file_path \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdata_out.csv\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m----> 2\u001b[0m topic_model, topic_df, most_frequent_topics \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPostprocessText\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m      3\u001b[0m \u001b[43m    \u001b[49m\u001b[43muse_csv\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcsv_path\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minput_file_path\u001b[49m\n\u001b[1;32m      4\u001b[0m \u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39manalyse_topic(return_topics\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m10\u001b[39m)\n",
 | 
						|
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:307\u001b[0m, in \u001b[0;36mPostprocessText.__init__\u001b[0;34m(self, mydict, use_csv, csv_path, analyze_text)\u001b[0m\n\u001b[1;32m    305\u001b[0m     \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReading data from df.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m    306\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdf \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(csv_path, encoding\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mutf8\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m--> 307\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlist_text_english \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_text_df\u001b[49m\u001b[43m(\u001b[49m\u001b[43manalyze_text\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    308\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m    309\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m    310\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease provide either dictionary with textual data or \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m    311\u001b[0m \u001b[38;5;124m                      a csv file by setting `use_csv` to True and providing a \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m    312\u001b[0m \u001b[38;5;124m                     `csv_path`.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    313\u001b[0m     )\n",
 | 
						|
      "File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:397\u001b[0m, in \u001b[0;36mPostprocessText.get_text_df\u001b[0;34m(self, analyze_text)\u001b[0m\n\u001b[1;32m    394\u001b[0m \u001b[38;5;66;03m# use csv file to obtain dataframe and put text_english or text_summary in list\u001b[39;00m\n\u001b[1;32m    395\u001b[0m \u001b[38;5;66;03m# check that \"text_english\" or \"text_summary\" is there\u001b[39;00m\n\u001b[1;32m    396\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m analyze_text \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdf:\n\u001b[0;32m--> 397\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m    398\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease check your provided dataframe - \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m    399\u001b[0m \u001b[38;5;124m                        no \u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m text data found.\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mformat(\n\u001b[1;32m    400\u001b[0m             analyze_text\n\u001b[1;32m    401\u001b[0m         )\n\u001b[1;32m    402\u001b[0m     )\n\u001b[1;32m    403\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdf[analyze_text]\u001b[38;5;241m.\u001b[39mtolist()\n",
 | 
						|
      "\u001b[0;31mValueError\u001b[0m: Please check your provided dataframe -                                 no text_english text data found."
 | 
						|
     ]
 | 
						|
    }
 | 
						|
   ],
 | 
						|
   "source": [
 | 
						|
    "input_file_path = \"data_out.csv\"\n",
 | 
						|
    "topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
 | 
						|
    "    use_csv=True, csv_path=input_file_path\n",
 | 
						|
    ").analyse_topic(return_topics=10)"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "0b6ef6d7",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "### Access frequent topics\n",
 | 
						|
    "A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 12,
 | 
						|
   "id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:02:04.908947Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:02:04.908392Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:02:04.966053Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:02:04.964776Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [
 | 
						|
    {
 | 
						|
     "ename": "NameError",
 | 
						|
     "evalue": "name 'topic_df' is not defined",
 | 
						|
     "output_type": "error",
 | 
						|
     "traceback": [
 | 
						|
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 | 
						|
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
 | 
						|
      "Cell \u001b[0;32mIn[12], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[43mtopic_df\u001b[49m)\n",
 | 
						|
      "\u001b[0;31mNameError\u001b[0m: name 'topic_df' is not defined"
 | 
						|
     ]
 | 
						|
    }
 | 
						|
   ],
 | 
						|
   "source": [
 | 
						|
    "print(topic_df)"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "b3316770",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "### Get information for specific topic\n",
 | 
						|
    "The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 13,
 | 
						|
   "id": "db14fe03",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:02:04.970804Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:02:04.969999Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:02:05.021679Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:02:05.020767Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [
 | 
						|
    {
 | 
						|
     "ename": "NameError",
 | 
						|
     "evalue": "name 'most_frequent_topics' is not defined",
 | 
						|
     "output_type": "error",
 | 
						|
     "traceback": [
 | 
						|
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 | 
						|
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
 | 
						|
      "Cell \u001b[0;32mIn[13], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m topic \u001b[38;5;129;01min\u001b[39;00m \u001b[43mmost_frequent_topics\u001b[49m:\n\u001b[1;32m      2\u001b[0m     \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mTopic:\u001b[39m\u001b[38;5;124m\"\u001b[39m, topic)\n",
 | 
						|
      "\u001b[0;31mNameError\u001b[0m: name 'most_frequent_topics' is not defined"
 | 
						|
     ]
 | 
						|
    }
 | 
						|
   ],
 | 
						|
   "source": [
 | 
						|
    "for topic in most_frequent_topics:\n",
 | 
						|
    "    print(\"Topic:\", topic)"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "d10f701e",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "### Topic visualization\n",
 | 
						|
    "The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 14,
 | 
						|
   "id": "2331afe6",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:02:05.026093Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:02:05.025196Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:02:05.073038Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:02:05.072205Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [
 | 
						|
    {
 | 
						|
     "ename": "NameError",
 | 
						|
     "evalue": "name 'topic_model' is not defined",
 | 
						|
     "output_type": "error",
 | 
						|
     "traceback": [
 | 
						|
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 | 
						|
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
 | 
						|
      "Cell \u001b[0;32mIn[14], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mtopic_model\u001b[49m\u001b[38;5;241m.\u001b[39mvisualize_topics()\n",
 | 
						|
      "\u001b[0;31mNameError\u001b[0m: name 'topic_model' is not defined"
 | 
						|
     ]
 | 
						|
    }
 | 
						|
   ],
 | 
						|
   "source": [
 | 
						|
    "topic_model.visualize_topics()"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "attachments": {},
 | 
						|
   "cell_type": "markdown",
 | 
						|
   "id": "f4eaf353",
 | 
						|
   "metadata": {},
 | 
						|
   "source": [
 | 
						|
    "### Save the model\n",
 | 
						|
    "The model can be saved for future use."
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": 15,
 | 
						|
   "id": "e5e8377c",
 | 
						|
   "metadata": {
 | 
						|
    "execution": {
 | 
						|
     "iopub.execute_input": "2023-09-18T12:02:05.077295Z",
 | 
						|
     "iopub.status.busy": "2023-09-18T12:02:05.076561Z",
 | 
						|
     "iopub.status.idle": "2023-09-18T12:02:05.123307Z",
 | 
						|
     "shell.execute_reply": "2023-09-18T12:02:05.122349Z"
 | 
						|
    }
 | 
						|
   },
 | 
						|
   "outputs": [
 | 
						|
    {
 | 
						|
     "ename": "NameError",
 | 
						|
     "evalue": "name 'topic_model' is not defined",
 | 
						|
     "output_type": "error",
 | 
						|
     "traceback": [
 | 
						|
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 | 
						|
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
 | 
						|
      "Cell \u001b[0;32mIn[15], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mtopic_model\u001b[49m\u001b[38;5;241m.\u001b[39msave(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmisinfo_posts\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
 | 
						|
      "\u001b[0;31mNameError\u001b[0m: name 'topic_model' is not defined"
 | 
						|
     ]
 | 
						|
    }
 | 
						|
   ],
 | 
						|
   "source": [
 | 
						|
    "topic_model.save(\"misinfo_posts\")"
 | 
						|
   ]
 | 
						|
  },
 | 
						|
  {
 | 
						|
   "cell_type": "code",
 | 
						|
   "execution_count": null,
 | 
						|
   "id": "7c94edb9",
 | 
						|
   "metadata": {},
 | 
						|
   "outputs": [],
 | 
						|
   "source": []
 | 
						|
  }
 | 
						|
 ],
 | 
						|
 "metadata": {
 | 
						|
  "kernelspec": {
 | 
						|
   "display_name": "Python 3 (ipykernel)",
 | 
						|
   "language": "python",
 | 
						|
   "name": "python3"
 | 
						|
  },
 | 
						|
  "language_info": {
 | 
						|
   "codemirror_mode": {
 | 
						|
    "name": "ipython",
 | 
						|
    "version": 3
 | 
						|
   },
 | 
						|
   "file_extension": ".py",
 | 
						|
   "mimetype": "text/x-python",
 | 
						|
   "name": "python",
 | 
						|
   "nbconvert_exporter": "python",
 | 
						|
   "pygments_lexer": "ipython3",
 | 
						|
   "version": "3.9.18"
 | 
						|
  },
 | 
						|
  "vscode": {
 | 
						|
   "interpreter": {
 | 
						|
    "hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2"
 | 
						|
   }
 | 
						|
  }
 | 
						|
 },
 | 
						|
 "nbformat": 4,
 | 
						|
 "nbformat_minor": 5
 | 
						|
}
 |