AMMICO/build/html/notebooks/Example text.ipynb

1092 строки
92 KiB
Plaintext

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "dcaa3da1",
"metadata": {},
"source": [
"# Notebook for text extraction on image\n",
"\n",
"The text extraction and analysis is carried out using a variety of tools: \n",
"\n",
"1. Text extraction from the image using [google-cloud-vision](https://cloud.google.com/vision) \n",
"1. Language detection of the extracted text using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
"1. Translation into English or other languages using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
"1. Cleaning of the text using [spacy](https://spacy.io/) \n",
"1. Spell-check using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
"1. Subjectivity analysis using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
"1. Text summarization using [transformers](https://huggingface.co/docs/transformers/index) pipelines\n",
"1. Sentiment analysis using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
"1. Named entity recognition using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
"1. Topic analysis using [BERTopic](https://github.com/MaartenGr/BERTopic) \n",
"\n",
"The first cell is only run on google colab and installs the [ammico](https://github.com/ssciwr/AMMICO) package.\n",
"\n",
"After that, we can import `ammico` and read in the files given a folder path."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f43f327c",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:11.062661Z",
"iopub.status.busy": "2023-06-27T07:38:11.062168Z",
"iopub.status.idle": "2023-06-27T07:38:11.072737Z",
"shell.execute_reply": "2023-06-27T07:38:11.071232Z"
}
},
"outputs": [],
"source": [
"# if running on google colab\n",
"# flake8-noqa-cell\n",
"import os\n",
"\n",
"if \"google.colab\" in str(get_ipython()):\n",
" # update python version\n",
" # install setuptools\n",
" # %pip install setuptools==61 -qqq\n",
" # install ammico\n",
" %pip install git+https://github.com/ssciwr/ammico.git -qqq\n",
" # mount google drive for data and API key\n",
" from google.colab import drive\n",
"\n",
" drive.mount(\"/content/drive\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cf362e60",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:11.075881Z",
"iopub.status.busy": "2023-06-27T07:38:11.075451Z",
"iopub.status.idle": "2023-06-27T07:38:26.395782Z",
"shell.execute_reply": "2023-06-27T07:38:26.395084Z"
}
},
"outputs": [],
"source": [
"import os\n",
"import ammico\n",
"from ammico import utils as mutils\n",
"from ammico import display as mdisplay"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "fddba721",
"metadata": {},
"source": [
"We select a subset of image files to try the text extraction on, see the `limit` keyword. The `find_files` function finds image files within a given directory: "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "27675810",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:26.399681Z",
"iopub.status.busy": "2023-06-27T07:38:26.398608Z",
"iopub.status.idle": "2023-06-27T07:38:26.403077Z",
"shell.execute_reply": "2023-06-27T07:38:26.402407Z"
}
},
"outputs": [],
"source": [
"# Here you need to provide the path to your google drive folder\n",
"# or local folder containing the images\n",
"images = mutils.find_files(\n",
" path=\"data/\",\n",
" limit=10,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "3a7dfe11",
"metadata": {},
"source": [
"We need to initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "8b32409f",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:26.406356Z",
"iopub.status.busy": "2023-06-27T07:38:26.405914Z",
"iopub.status.idle": "2023-06-27T07:38:26.409192Z",
"shell.execute_reply": "2023-06-27T07:38:26.408494Z"
}
},
"outputs": [],
"source": [
"mydict = mutils.initialize_dict(images)"
]
},
{
"cell_type": "markdown",
"id": "7b8b929f",
"metadata": {},
"source": [
"## Google cloud vision API\n",
"\n",
"For this you need an API key and have the app activated in your google console. The first 1000 images per month are free (July 2022)."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986",
"metadata": {},
"source": [
"```\n",
"os.environ[\n",
" \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
"] = \"your-credentials.json\"\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0891b795-c7fe-454c-a45d-45fadf788142",
"metadata": {},
"source": [
"## Inspect the elements per image\n",
"To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below.\n",
"Here, we display the text extraction and translation results provided by the above libraries. Click on the tabs to see the results in the right sidebar. You may need to increment the `port` number if you are already running several notebook instances on the same server."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "7c6ecc88",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:26.412604Z",
"iopub.status.busy": "2023-06-27T07:38:26.412166Z",
"iopub.status.idle": "2023-06-27T07:38:27.235923Z",
"shell.execute_reply": "2023-06-27T07:38:27.235113Z"
}
},
"outputs": [
{
"ename": "TypeError",
"evalue": "__init__() got an unexpected keyword argument 'identify'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[5], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m analysis_explorer \u001b[38;5;241m=\u001b[39m \u001b[43mmdisplay\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mAnalysisExplorer\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43midentify\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mtext-on-image\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2\u001b[0m analysis_explorer\u001b[38;5;241m.\u001b[39mrun_server(port\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m8054\u001b[39m)\n",
"\u001b[0;31mTypeError\u001b[0m: __init__() got an unexpected keyword argument 'identify'"
]
}
],
"source": [
"analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify=\"text-on-image\")\n",
"analysis_explorer.run_server(port=8054)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9c3e72b5-0e57-4019-b45e-3e36a74e7f52",
"metadata": {},
"source": [
"## Or directly analyze for further processing\n",
"Instead of inspecting each of the images, you can also directly carry out the analysis and export the result into a csv. This may take a while depending on how many images you have loaded. Set the keyword `analyse_text` to `True` if you want the text to be analyzed (spell check, subjectivity, text summary, sentiment, NER)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "365c78b1-7ff4-4213-86fa-6a0a2d05198f",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:27.239844Z",
"iopub.status.busy": "2023-06-27T07:38:27.239260Z",
"iopub.status.idle": "2023-06-27T07:38:56.432379Z",
"shell.execute_reply": "2023-06-27T07:38:56.429232Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading (…)/a4f8f3e/config.json: 0%| | 0.00/1.80k [00:00<?, ?B/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading (…)/a4f8f3e/config.json: 100%|██████████| 1.80k/1.80k [00:00<00:00, 494kB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 0%| | 0.00/1.22G [00:00<?, ?B/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 2%|▏ | 21.0M/1.22G [00:00<00:06, 176MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 4%|▍ | 52.4M/1.22G [00:00<00:05, 227MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 7%|▋ | 83.9M/1.22G [00:00<00:04, 244MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 9%|▉ | 115M/1.22G [00:00<00:04, 251MB/s] "
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 12%|█▏ | 147M/1.22G [00:00<00:04, 251MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 15%|█▍ | 178M/1.22G [00:00<00:04, 252MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 17%|█▋ | 210M/1.22G [00:00<00:03, 256MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 20%|█▉ | 241M/1.22G [00:00<00:03, 259MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 22%|██▏ | 273M/1.22G [00:01<00:03, 259MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 25%|██▍ | 304M/1.22G [00:01<00:03, 261MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 27%|██▋ | 336M/1.22G [00:01<00:03, 259MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 30%|███ | 367M/1.22G [00:01<00:03, 262MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 33%|███▎ | 398M/1.22G [00:01<00:03, 262MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 35%|███▌ | 430M/1.22G [00:01<00:03, 263MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 38%|███▊ | 461M/1.22G [00:01<00:02, 264MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 40%|████ | 493M/1.22G [00:01<00:02, 264MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 43%|████▎ | 524M/1.22G [00:02<00:02, 261MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 45%|████▌ | 556M/1.22G [00:02<00:02, 259MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 48%|████▊ | 587M/1.22G [00:02<00:02, 255MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 51%|█████ | 619M/1.22G [00:02<00:02, 258MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 53%|█████▎ | 650M/1.22G [00:02<00:02, 257MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 56%|█████▌ | 682M/1.22G [00:02<00:02, 258MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 58%|█████▊ | 713M/1.22G [00:02<00:02, 202MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 61%|██████ | 744M/1.22G [00:03<00:02, 214MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 63%|██████▎ | 776M/1.22G [00:03<00:02, 212MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 66%|██████▌ | 807M/1.22G [00:03<00:01, 222MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 69%|██████▊ | 839M/1.22G [00:03<00:01, 210MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 71%|███████ | 870M/1.22G [00:03<00:01, 223MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 74%|███████▍ | 902M/1.22G [00:03<00:01, 213MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 76%|███████▋ | 933M/1.22G [00:03<00:01, 223MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 79%|███████▉ | 965M/1.22G [00:04<00:01, 195MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 81%|████████▏ | 996M/1.22G [00:04<00:01, 209MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 84%|████████▍ | 1.03G/1.22G [00:04<00:00, 204MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 87%|████████▋ | 1.06G/1.22G [00:04<00:00, 219MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 89%|████████▉ | 1.09G/1.22G [00:04<00:00, 203MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 92%|█████████▏| 1.12G/1.22G [00:04<00:00, 214MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 94%|█████████▍| 1.15G/1.22G [00:04<00:00, 206MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 97%|█████████▋| 1.18G/1.22G [00:05<00:00, 124MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 100%|█████████▉| 1.22G/1.22G [00:05<00:00, 147MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading pytorch_model.bin: 100%|██████████| 1.22G/1.22G [00:05<00:00, 218MB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading (…)okenizer_config.json: 0%| | 0.00/26.0 [00:00<?, ?B/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Downloading (…)okenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 13.6kB/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"ename": "ReadTimeout",
"evalue": "HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10.0)",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mtimeout\u001b[0m Traceback (most recent call last)",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/urllib3/connectionpool.py:466\u001b[0m, in \u001b[0;36mHTTPConnectionPool._make_request\u001b[0;34m(self, conn, method, url, timeout, chunked, **httplib_request_kw)\u001b[0m\n\u001b[1;32m 462\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mBaseException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[1;32m 463\u001b[0m \u001b[38;5;66;03m# Remove the TypeError from the exception chain in\u001b[39;00m\n\u001b[1;32m 464\u001b[0m \u001b[38;5;66;03m# Python 3 (including for exceptions like SystemExit).\u001b[39;00m\n\u001b[1;32m 465\u001b[0m \u001b[38;5;66;03m# Otherwise it looks like a bug in the code.\u001b[39;00m\n\u001b[0;32m--> 466\u001b[0m \u001b[43msix\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mraise_from\u001b[49m\u001b[43m(\u001b[49m\u001b[43me\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\n\u001b[1;32m 467\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m (SocketTimeout, BaseSSLError, SocketError) \u001b[38;5;28;01mas\u001b[39;00m e:\n",
"File \u001b[0;32m<string>:3\u001b[0m, in \u001b[0;36mraise_from\u001b[0;34m(value, from_value)\u001b[0m\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/urllib3/connectionpool.py:461\u001b[0m, in \u001b[0;36mHTTPConnectionPool._make_request\u001b[0;34m(self, conn, method, url, timeout, chunked, **httplib_request_kw)\u001b[0m\n\u001b[1;32m 460\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 461\u001b[0m httplib_response \u001b[38;5;241m=\u001b[39m \u001b[43mconn\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mgetresponse\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 462\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mBaseException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[1;32m 463\u001b[0m \u001b[38;5;66;03m# Remove the TypeError from the exception chain in\u001b[39;00m\n\u001b[1;32m 464\u001b[0m \u001b[38;5;66;03m# Python 3 (including for exceptions like SystemExit).\u001b[39;00m\n\u001b[1;32m 465\u001b[0m \u001b[38;5;66;03m# Otherwise it looks like a bug in the code.\u001b[39;00m\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/http/client.py:1377\u001b[0m, in \u001b[0;36mHTTPConnection.getresponse\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1376\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m-> 1377\u001b[0m \u001b[43mresponse\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbegin\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1378\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mConnectionError\u001b[39;00m:\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/http/client.py:320\u001b[0m, in \u001b[0;36mHTTPResponse.begin\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 319\u001b[0m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;28;01mTrue\u001b[39;00m:\n\u001b[0;32m--> 320\u001b[0m version, status, reason \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_read_status\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 321\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m status \u001b[38;5;241m!=\u001b[39m CONTINUE:\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/http/client.py:281\u001b[0m, in \u001b[0;36mHTTPResponse._read_status\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 280\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_read_status\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[0;32m--> 281\u001b[0m line \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mstr\u001b[39m(\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mreadline\u001b[49m\u001b[43m(\u001b[49m\u001b[43m_MAXLINE\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m+\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m)\u001b[49m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124miso-8859-1\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 282\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(line) \u001b[38;5;241m>\u001b[39m _MAXLINE:\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/socket.py:704\u001b[0m, in \u001b[0;36mSocketIO.readinto\u001b[0;34m(self, b)\u001b[0m\n\u001b[1;32m 703\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 704\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_sock\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrecv_into\u001b[49m\u001b[43m(\u001b[49m\u001b[43mb\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 705\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m timeout:\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/ssl.py:1242\u001b[0m, in \u001b[0;36mSSLSocket.recv_into\u001b[0;34m(self, buffer, nbytes, flags)\u001b[0m\n\u001b[1;32m 1239\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 1240\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnon-zero flags not allowed in calls to recv_into() on \u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m\n\u001b[1;32m 1241\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__class__\u001b[39m)\n\u001b[0;32m-> 1242\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnbytes\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mbuffer\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1243\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/ssl.py:1100\u001b[0m, in \u001b[0;36mSSLSocket.read\u001b[0;34m(self, len, buffer)\u001b[0m\n\u001b[1;32m 1099\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m buffer \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m-> 1100\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_sslobj\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mlen\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mbuffer\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1101\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n",
"\u001b[0;31mtimeout\u001b[0m: The read operation timed out",
"\nDuring handling of the above exception, another exception occurred:\n",
"\u001b[0;31mReadTimeoutError\u001b[0m Traceback (most recent call last)",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/requests/adapters.py:486\u001b[0m, in \u001b[0;36mHTTPAdapter.send\u001b[0;34m(self, request, stream, timeout, verify, cert, proxies)\u001b[0m\n\u001b[1;32m 485\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 486\u001b[0m resp \u001b[38;5;241m=\u001b[39m \u001b[43mconn\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43murlopen\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 487\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrequest\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 488\u001b[0m \u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 489\u001b[0m \u001b[43m \u001b[49m\u001b[43mbody\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrequest\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbody\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 490\u001b[0m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrequest\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 491\u001b[0m \u001b[43m \u001b[49m\u001b[43mredirect\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 492\u001b[0m \u001b[43m \u001b[49m\u001b[43massert_same_host\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 493\u001b[0m \u001b[43m \u001b[49m\u001b[43mpreload_content\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 494\u001b[0m \u001b[43m \u001b[49m\u001b[43mdecode_content\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 495\u001b[0m \u001b[43m \u001b[49m\u001b[43mretries\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmax_retries\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 496\u001b[0m \u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtimeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 497\u001b[0m \u001b[43m \u001b[49m\u001b[43mchunked\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mchunked\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 498\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 500\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m (ProtocolError, \u001b[38;5;167;01mOSError\u001b[39;00m) \u001b[38;5;28;01mas\u001b[39;00m err:\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/urllib3/connectionpool.py:798\u001b[0m, in \u001b[0;36mHTTPConnectionPool.urlopen\u001b[0;34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)\u001b[0m\n\u001b[1;32m 796\u001b[0m e \u001b[38;5;241m=\u001b[39m ProtocolError(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mConnection aborted.\u001b[39m\u001b[38;5;124m\"\u001b[39m, e)\n\u001b[0;32m--> 798\u001b[0m retries \u001b[38;5;241m=\u001b[39m \u001b[43mretries\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mincrement\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 799\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43merror\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43me\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m_pool\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m_stacktrace\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43msys\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mexc_info\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m]\u001b[49m\n\u001b[1;32m 800\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 801\u001b[0m retries\u001b[38;5;241m.\u001b[39msleep()\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/urllib3/util/retry.py:550\u001b[0m, in \u001b[0;36mRetry.increment\u001b[0;34m(self, method, url, response, error, _pool, _stacktrace)\u001b[0m\n\u001b[1;32m 549\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m read \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mFalse\u001b[39;00m \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_is_method_retryable(method):\n\u001b[0;32m--> 550\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[43msix\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mreraise\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mtype\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43merror\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43merror\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m_stacktrace\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 551\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m read \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/urllib3/packages/six.py:770\u001b[0m, in \u001b[0;36mreraise\u001b[0;34m(tp, value, tb)\u001b[0m\n\u001b[1;32m 769\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m value\u001b[38;5;241m.\u001b[39mwith_traceback(tb)\n\u001b[0;32m--> 770\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m value\n\u001b[1;32m 771\u001b[0m \u001b[38;5;28;01mfinally\u001b[39;00m:\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/urllib3/connectionpool.py:714\u001b[0m, in \u001b[0;36mHTTPConnectionPool.urlopen\u001b[0;34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)\u001b[0m\n\u001b[1;32m 713\u001b[0m \u001b[38;5;66;03m# Make the request on the httplib connection object.\u001b[39;00m\n\u001b[0;32m--> 714\u001b[0m httplib_response \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_make_request\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 715\u001b[0m \u001b[43m \u001b[49m\u001b[43mconn\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 716\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 717\u001b[0m \u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 718\u001b[0m \u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtimeout_obj\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 719\u001b[0m \u001b[43m \u001b[49m\u001b[43mbody\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbody\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 720\u001b[0m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 721\u001b[0m \u001b[43m \u001b[49m\u001b[43mchunked\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mchunked\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 722\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 724\u001b[0m \u001b[38;5;66;03m# If we're going to release the connection in ``finally:``, then\u001b[39;00m\n\u001b[1;32m 725\u001b[0m \u001b[38;5;66;03m# the response doesn't need to know about the connection. Otherwise\u001b[39;00m\n\u001b[1;32m 726\u001b[0m \u001b[38;5;66;03m# it will also try to release it and we'll have a double-release\u001b[39;00m\n\u001b[1;32m 727\u001b[0m \u001b[38;5;66;03m# mess.\u001b[39;00m\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/urllib3/connectionpool.py:468\u001b[0m, in \u001b[0;36mHTTPConnectionPool._make_request\u001b[0;34m(self, conn, method, url, timeout, chunked, **httplib_request_kw)\u001b[0m\n\u001b[1;32m 467\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m (SocketTimeout, BaseSSLError, SocketError) \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[0;32m--> 468\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_raise_timeout\u001b[49m\u001b[43m(\u001b[49m\u001b[43merr\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43me\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtimeout_value\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mread_timeout\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 469\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/urllib3/connectionpool.py:357\u001b[0m, in \u001b[0;36mHTTPConnectionPool._raise_timeout\u001b[0;34m(self, err, url, timeout_value)\u001b[0m\n\u001b[1;32m 356\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(err, SocketTimeout):\n\u001b[0;32m--> 357\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m ReadTimeoutError(\n\u001b[1;32m 358\u001b[0m \u001b[38;5;28mself\u001b[39m, url, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mRead timed out. (read timeout=\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m)\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m timeout_value\n\u001b[1;32m 359\u001b[0m )\n\u001b[1;32m 361\u001b[0m \u001b[38;5;66;03m# See the above comment about EAGAIN in Python 3. In Python 2 we have\u001b[39;00m\n\u001b[1;32m 362\u001b[0m \u001b[38;5;66;03m# to specifically catch it and throw the timeout error\u001b[39;00m\n",
"\u001b[0;31mReadTimeoutError\u001b[0m: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10.0)",
"\nDuring handling of the above exception, another exception occurred:\n",
"\u001b[0;31mReadTimeout\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[6], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m key \u001b[38;5;129;01min\u001b[39;00m mydict:\n\u001b[0;32m----> 2\u001b[0m mydict[key] \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mTextDetector\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m[\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43manalyse_text\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\n\u001b[1;32m 4\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43manalyse_image\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:47\u001b[0m, in \u001b[0;36mTextDetector.analyse_image\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 45\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mremove_linebreaks()\n\u001b[1;32m 46\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39manalyse_text:\n\u001b[0;32m---> 47\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext_summary\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 48\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtext_sentiment_transformers()\n\u001b[1;32m 49\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtext_ner()\n",
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:105\u001b[0m, in \u001b[0;36mTextDetector.text_summary\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 103\u001b[0m model_revision \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124ma4f8f3e\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 104\u001b[0m max_number_of_characters \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m3000\u001b[39m\n\u001b[0;32m--> 105\u001b[0m pipe \u001b[38;5;241m=\u001b[39m \u001b[43mpipeline\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 106\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43msummarization\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 107\u001b[0m \u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmodel_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 108\u001b[0m \u001b[43m \u001b[49m\u001b[43mrevision\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmodel_revision\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 109\u001b[0m \u001b[43m \u001b[49m\u001b[43mmin_length\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m5\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 110\u001b[0m \u001b[43m \u001b[49m\u001b[43mmax_length\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m20\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 111\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 112\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 113\u001b[0m summary \u001b[38;5;241m=\u001b[39m pipe(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39msubdict[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext_english\u001b[39m\u001b[38;5;124m\"\u001b[39m][\u001b[38;5;241m0\u001b[39m:max_number_of_characters])\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/transformers/pipelines/__init__.py:828\u001b[0m, in \u001b[0;36mpipeline\u001b[0;34m(task, model, config, tokenizer, feature_extractor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)\u001b[0m\n\u001b[1;32m 825\u001b[0m tokenizer_identifier \u001b[38;5;241m=\u001b[39m tokenizer\n\u001b[1;32m 826\u001b[0m tokenizer_kwargs \u001b[38;5;241m=\u001b[39m model_kwargs\n\u001b[0;32m--> 828\u001b[0m tokenizer \u001b[38;5;241m=\u001b[39m \u001b[43mAutoTokenizer\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_pretrained\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 829\u001b[0m \u001b[43m \u001b[49m\u001b[43mtokenizer_identifier\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43muse_fast\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muse_fast\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m_from_pipeline\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtask\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mhub_kwargs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mtokenizer_kwargs\u001b[49m\n\u001b[1;32m 830\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 832\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m load_feature_extractor:\n\u001b[1;32m 833\u001b[0m \u001b[38;5;66;03m# Try to infer feature extractor from model or config name (if provided as str)\u001b[39;00m\n\u001b[1;32m 834\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m feature_extractor \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:676\u001b[0m, in \u001b[0;36mAutoTokenizer.from_pretrained\u001b[0;34m(cls, pretrained_model_name_or_path, *inputs, **kwargs)\u001b[0m\n\u001b[1;32m 674\u001b[0m tokenizer_class_py, tokenizer_class_fast \u001b[38;5;241m=\u001b[39m TOKENIZER_MAPPING[\u001b[38;5;28mtype\u001b[39m(config)]\n\u001b[1;32m 675\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m tokenizer_class_fast \u001b[38;5;129;01mand\u001b[39;00m (use_fast \u001b[38;5;129;01mor\u001b[39;00m tokenizer_class_py \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m):\n\u001b[0;32m--> 676\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mtokenizer_class_fast\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_pretrained\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpretrained_model_name_or_path\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43minputs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 677\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 678\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m tokenizer_class_py \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:1763\u001b[0m, in \u001b[0;36mPreTrainedTokenizerBase.from_pretrained\u001b[0;34m(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)\u001b[0m\n\u001b[1;32m 1761\u001b[0m resolved_vocab_files[file_id] \u001b[38;5;241m=\u001b[39m download_url(file_path, proxies\u001b[38;5;241m=\u001b[39mproxies)\n\u001b[1;32m 1762\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1763\u001b[0m resolved_vocab_files[file_id] \u001b[38;5;241m=\u001b[39m \u001b[43mcached_file\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 1764\u001b[0m \u001b[43m \u001b[49m\u001b[43mpretrained_model_name_or_path\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1765\u001b[0m \u001b[43m \u001b[49m\u001b[43mfile_path\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1766\u001b[0m \u001b[43m \u001b[49m\u001b[43mcache_dir\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcache_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1767\u001b[0m \u001b[43m \u001b[49m\u001b[43mforce_download\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mforce_download\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1768\u001b[0m \u001b[43m \u001b[49m\u001b[43mproxies\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mproxies\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1769\u001b[0m \u001b[43m \u001b[49m\u001b[43mresume_download\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mresume_download\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1770\u001b[0m \u001b[43m \u001b[49m\u001b[43mlocal_files_only\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlocal_files_only\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1771\u001b[0m \u001b[43m \u001b[49m\u001b[43muse_auth_token\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muse_auth_token\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1772\u001b[0m \u001b[43m \u001b[49m\u001b[43muser_agent\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muser_agent\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1773\u001b[0m \u001b[43m \u001b[49m\u001b[43mrevision\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrevision\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1774\u001b[0m \u001b[43m \u001b[49m\u001b[43msubfolder\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43msubfolder\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1775\u001b[0m \u001b[43m \u001b[49m\u001b[43m_raise_exceptions_for_missing_entries\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 1776\u001b[0m \u001b[43m \u001b[49m\u001b[43m_raise_exceptions_for_connection_errors\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 1777\u001b[0m \u001b[43m \u001b[49m\u001b[43m_commit_hash\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcommit_hash\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1778\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1779\u001b[0m commit_hash \u001b[38;5;241m=\u001b[39m extract_commit_hash(resolved_vocab_files[file_id], commit_hash)\n\u001b[1;32m 1781\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(unresolved_files) \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m0\u001b[39m:\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/transformers/utils/hub.py:409\u001b[0m, in \u001b[0;36mcached_file\u001b[0;34m(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only, subfolder, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash)\u001b[0m\n\u001b[1;32m 406\u001b[0m user_agent \u001b[38;5;241m=\u001b[39m http_user_agent(user_agent)\n\u001b[1;32m 407\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 408\u001b[0m \u001b[38;5;66;03m# Load from URL or cache if already cached\u001b[39;00m\n\u001b[0;32m--> 409\u001b[0m resolved_file \u001b[38;5;241m=\u001b[39m \u001b[43mhf_hub_download\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 410\u001b[0m \u001b[43m \u001b[49m\u001b[43mpath_or_repo_id\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 411\u001b[0m \u001b[43m \u001b[49m\u001b[43mfilename\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 412\u001b[0m \u001b[43m \u001b[49m\u001b[43msubfolder\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mif\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43mlen\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43msubfolder\u001b[49m\u001b[43m)\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m==\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01melse\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43msubfolder\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 413\u001b[0m \u001b[43m \u001b[49m\u001b[43mrevision\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrevision\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 414\u001b[0m \u001b[43m \u001b[49m\u001b[43mcache_dir\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcache_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 415\u001b[0m \u001b[43m \u001b[49m\u001b[43muser_agent\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muser_agent\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 416\u001b[0m \u001b[43m \u001b[49m\u001b[43mforce_download\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mforce_download\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 417\u001b[0m \u001b[43m \u001b[49m\u001b[43mproxies\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mproxies\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 418\u001b[0m \u001b[43m \u001b[49m\u001b[43mresume_download\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mresume_download\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 419\u001b[0m \u001b[43m \u001b[49m\u001b[43muse_auth_token\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muse_auth_token\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 420\u001b[0m \u001b[43m \u001b[49m\u001b[43mlocal_files_only\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlocal_files_only\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 421\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 423\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m RepositoryNotFoundError:\n\u001b[1;32m 424\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mEnvironmentError\u001b[39;00m(\n\u001b[1;32m 425\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mpath_or_repo_id\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m is not a local folder and is not a valid model identifier \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 426\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mlisted on \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mhttps://huggingface.co/models\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124mIf this is a private repository, make sure to \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 427\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mpass a token having permission to this repo with `use_auth_token` or log in with \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 428\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m`huggingface-cli login` and pass `use_auth_token=True`.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 429\u001b[0m )\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py:118\u001b[0m, in \u001b[0;36mvalidate_hf_hub_args.<locals>._inner_fn\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 115\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m check_use_auth_token:\n\u001b[1;32m 116\u001b[0m kwargs \u001b[38;5;241m=\u001b[39m smoothly_deprecate_use_auth_token(fn_name\u001b[38;5;241m=\u001b[39mfn\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__name__\u001b[39m, has_token\u001b[38;5;241m=\u001b[39mhas_token, kwargs\u001b[38;5;241m=\u001b[39mkwargs)\n\u001b[0;32m--> 118\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfn\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/huggingface_hub/file_download.py:1364\u001b[0m, in \u001b[0;36mhf_hub_download\u001b[0;34m(repo_id, filename, subfolder, repo_type, revision, library_name, library_version, cache_dir, local_dir, local_dir_use_symlinks, user_agent, force_download, force_filename, proxies, etag_timeout, resume_download, token, local_files_only, legacy_cache_layout)\u001b[0m\n\u001b[1;32m 1361\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m temp_file_manager() \u001b[38;5;28;01mas\u001b[39;00m temp_file:\n\u001b[1;32m 1362\u001b[0m logger\u001b[38;5;241m.\u001b[39minfo(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdownloading \u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m to \u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m\"\u001b[39m, url, temp_file\u001b[38;5;241m.\u001b[39mname)\n\u001b[0;32m-> 1364\u001b[0m \u001b[43mhttp_get\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 1365\u001b[0m \u001b[43m \u001b[49m\u001b[43murl_to_download\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1366\u001b[0m \u001b[43m \u001b[49m\u001b[43mtemp_file\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1367\u001b[0m \u001b[43m \u001b[49m\u001b[43mproxies\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mproxies\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1368\u001b[0m \u001b[43m \u001b[49m\u001b[43mresume_size\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mresume_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1369\u001b[0m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1370\u001b[0m \u001b[43m \u001b[49m\u001b[43mexpected_size\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mexpected_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1371\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1373\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m local_dir \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 1374\u001b[0m logger\u001b[38;5;241m.\u001b[39minfo(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mStoring \u001b[39m\u001b[38;5;132;01m{\u001b[39;00murl\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m in cache at \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mblob_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/huggingface_hub/file_download.py:505\u001b[0m, in \u001b[0;36mhttp_get\u001b[0;34m(url, temp_file, proxies, resume_size, headers, timeout, max_retries, expected_size)\u001b[0m\n\u001b[1;32m 502\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m resume_size \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m0\u001b[39m:\n\u001b[1;32m 503\u001b[0m headers[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mRange\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mbytes=\u001b[39m\u001b[38;5;132;01m%d\u001b[39;00m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m (resume_size,)\n\u001b[0;32m--> 505\u001b[0m r \u001b[38;5;241m=\u001b[39m \u001b[43m_request_wrapper\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 506\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mGET\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 507\u001b[0m \u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 508\u001b[0m \u001b[43m \u001b[49m\u001b[43mstream\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 509\u001b[0m \u001b[43m \u001b[49m\u001b[43mproxies\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mproxies\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 510\u001b[0m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 511\u001b[0m \u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtimeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 512\u001b[0m \u001b[43m \u001b[49m\u001b[43mmax_retries\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmax_retries\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 513\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 514\u001b[0m hf_raise_for_status(r)\n\u001b[1;32m 515\u001b[0m content_length \u001b[38;5;241m=\u001b[39m r\u001b[38;5;241m.\u001b[39mheaders\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mContent-Length\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/huggingface_hub/file_download.py:442\u001b[0m, in \u001b[0;36m_request_wrapper\u001b[0;34m(method, url, max_retries, base_wait_time, max_wait_time, timeout, follow_relative_redirects, **params)\u001b[0m\n\u001b[1;32m 439\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m response\n\u001b[1;32m 441\u001b[0m \u001b[38;5;66;03m# 3. Exponential backoff\u001b[39;00m\n\u001b[0;32m--> 442\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mhttp_backoff\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 443\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 444\u001b[0m \u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 445\u001b[0m \u001b[43m \u001b[49m\u001b[43mmax_retries\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmax_retries\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 446\u001b[0m \u001b[43m \u001b[49m\u001b[43mbase_wait_time\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbase_wait_time\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 447\u001b[0m \u001b[43m \u001b[49m\u001b[43mmax_wait_time\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmax_wait_time\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 448\u001b[0m \u001b[43m \u001b[49m\u001b[43mretry_on_exceptions\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mConnectTimeout\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mProxyError\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 449\u001b[0m \u001b[43m \u001b[49m\u001b[43mretry_on_status_codes\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 450\u001b[0m \u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtimeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 451\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mparams\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 452\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/huggingface_hub/utils/_http.py:212\u001b[0m, in \u001b[0;36mhttp_backoff\u001b[0;34m(method, url, max_retries, base_wait_time, max_wait_time, retry_on_exceptions, retry_on_status_codes, **kwargs)\u001b[0m\n\u001b[1;32m 209\u001b[0m kwargs[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdata\u001b[39m\u001b[38;5;124m\"\u001b[39m]\u001b[38;5;241m.\u001b[39mseek(io_obj_initial_pos)\n\u001b[1;32m 211\u001b[0m \u001b[38;5;66;03m# Perform request and return if status_code is not in the retry list.\u001b[39;00m\n\u001b[0;32m--> 212\u001b[0m response \u001b[38;5;241m=\u001b[39m \u001b[43msession\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrequest\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmethod\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 213\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m response\u001b[38;5;241m.\u001b[39mstatus_code \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m retry_on_status_codes:\n\u001b[1;32m 214\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m response\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/requests/sessions.py:589\u001b[0m, in \u001b[0;36mSession.request\u001b[0;34m(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)\u001b[0m\n\u001b[1;32m 584\u001b[0m send_kwargs \u001b[38;5;241m=\u001b[39m {\n\u001b[1;32m 585\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtimeout\u001b[39m\u001b[38;5;124m\"\u001b[39m: timeout,\n\u001b[1;32m 586\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mallow_redirects\u001b[39m\u001b[38;5;124m\"\u001b[39m: allow_redirects,\n\u001b[1;32m 587\u001b[0m }\n\u001b[1;32m 588\u001b[0m send_kwargs\u001b[38;5;241m.\u001b[39mupdate(settings)\n\u001b[0;32m--> 589\u001b[0m resp \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43msend\u001b[49m\u001b[43m(\u001b[49m\u001b[43mprep\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43msend_kwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 591\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m resp\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/requests/sessions.py:703\u001b[0m, in \u001b[0;36mSession.send\u001b[0;34m(self, request, **kwargs)\u001b[0m\n\u001b[1;32m 700\u001b[0m start \u001b[38;5;241m=\u001b[39m preferred_clock()\n\u001b[1;32m 702\u001b[0m \u001b[38;5;66;03m# Send the request\u001b[39;00m\n\u001b[0;32m--> 703\u001b[0m r \u001b[38;5;241m=\u001b[39m \u001b[43madapter\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43msend\u001b[49m\u001b[43m(\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 705\u001b[0m \u001b[38;5;66;03m# Total elapsed time of the request (approximately)\u001b[39;00m\n\u001b[1;32m 706\u001b[0m elapsed \u001b[38;5;241m=\u001b[39m preferred_clock() \u001b[38;5;241m-\u001b[39m start\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/requests/adapters.py:532\u001b[0m, in \u001b[0;36mHTTPAdapter.send\u001b[0;34m(self, request, stream, timeout, verify, cert, proxies)\u001b[0m\n\u001b[1;32m 530\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m SSLError(e, request\u001b[38;5;241m=\u001b[39mrequest)\n\u001b[1;32m 531\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(e, ReadTimeoutError):\n\u001b[0;32m--> 532\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m ReadTimeout(e, request\u001b[38;5;241m=\u001b[39mrequest)\n\u001b[1;32m 533\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(e, _InvalidHeader):\n\u001b[1;32m 534\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m InvalidHeader(e, request\u001b[38;5;241m=\u001b[39mrequest)\n",
"\u001b[0;31mReadTimeout\u001b[0m: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10.0)"
]
}
],
"source": [
"for key in mydict:\n",
" mydict[key] = ammico.text.TextDetector(\n",
" mydict[key], analyse_text=True\n",
" ).analyse_image()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "3c063eda",
"metadata": {},
"source": [
"## Convert to dataframe and write csv\n",
"These steps are required to convert the dictionary of dictionarys into a dictionary with lists, that can be converted into a pandas dataframe and exported to a csv file."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "5709c2cd",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:56.436343Z",
"iopub.status.busy": "2023-06-27T07:38:56.435835Z",
"iopub.status.idle": "2023-06-27T07:38:57.091423Z",
"shell.execute_reply": "2023-06-27T07:38:57.090580Z"
}
},
"outputs": [
{
"ename": "ValueError",
"evalue": "All arrays must be of the same length",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[7], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m outdict \u001b[38;5;241m=\u001b[39m mutils\u001b[38;5;241m.\u001b[39mappend_data_to_dict(mydict)\n\u001b[0;32m----> 2\u001b[0m df \u001b[38;5;241m=\u001b[39m \u001b[43mmutils\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdump_df\u001b[49m\u001b[43m(\u001b[49m\u001b[43moutdict\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/utils.py:106\u001b[0m, in \u001b[0;36mdump_df\u001b[0;34m(mydict)\u001b[0m\n\u001b[1;32m 104\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mdump_df\u001b[39m(mydict: \u001b[38;5;28mdict\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m DataFrame:\n\u001b[1;32m 105\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Utility to dump the dictionary into a dataframe.\"\"\"\u001b[39;00m\n\u001b[0;32m--> 106\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mDataFrame\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/frame.py:1760\u001b[0m, in \u001b[0;36mDataFrame.from_dict\u001b[0;34m(cls, data, orient, dtype, columns)\u001b[0m\n\u001b[1;32m 1754\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 1755\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mExpected \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mindex\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m, \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcolumns\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m or \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mtight\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m for orient parameter. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1756\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mGot \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00morient\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m instead\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1757\u001b[0m )\n\u001b[1;32m 1759\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m orient \u001b[38;5;241m!=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtight\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[0;32m-> 1760\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mcls\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1761\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 1762\u001b[0m realdata \u001b[38;5;241m=\u001b[39m data[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdata\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/frame.py:709\u001b[0m, in \u001b[0;36mDataFrame.__init__\u001b[0;34m(self, data, index, columns, dtype, copy)\u001b[0m\n\u001b[1;32m 703\u001b[0m mgr \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_init_mgr(\n\u001b[1;32m 704\u001b[0m data, axes\u001b[38;5;241m=\u001b[39m{\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mindex\u001b[39m\u001b[38;5;124m\"\u001b[39m: index, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcolumns\u001b[39m\u001b[38;5;124m\"\u001b[39m: columns}, dtype\u001b[38;5;241m=\u001b[39mdtype, copy\u001b[38;5;241m=\u001b[39mcopy\n\u001b[1;32m 705\u001b[0m )\n\u001b[1;32m 707\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, \u001b[38;5;28mdict\u001b[39m):\n\u001b[1;32m 708\u001b[0m \u001b[38;5;66;03m# GH#38939 de facto copy defaults to False only in non-dict cases\u001b[39;00m\n\u001b[0;32m--> 709\u001b[0m mgr \u001b[38;5;241m=\u001b[39m \u001b[43mdict_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcopy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmanager\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 710\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, ma\u001b[38;5;241m.\u001b[39mMaskedArray):\n\u001b[1;32m 711\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mma\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m mrecords\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:481\u001b[0m, in \u001b[0;36mdict_to_mgr\u001b[0;34m(data, index, columns, dtype, typ, copy)\u001b[0m\n\u001b[1;32m 477\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 478\u001b[0m \u001b[38;5;66;03m# dtype check to exclude e.g. range objects, scalars\u001b[39;00m\n\u001b[1;32m 479\u001b[0m arrays \u001b[38;5;241m=\u001b[39m [x\u001b[38;5;241m.\u001b[39mcopy() \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(x, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdtype\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01melse\u001b[39;00m x \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m arrays]\n\u001b[0;32m--> 481\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43marrays_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtyp\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mconsolidate\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:115\u001b[0m, in \u001b[0;36marrays_to_mgr\u001b[0;34m(arrays, columns, index, dtype, verify_integrity, typ, consolidate)\u001b[0m\n\u001b[1;32m 112\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m verify_integrity:\n\u001b[1;32m 113\u001b[0m \u001b[38;5;66;03m# figure out the index, if necessary\u001b[39;00m\n\u001b[1;32m 114\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m index \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m--> 115\u001b[0m index \u001b[38;5;241m=\u001b[39m \u001b[43m_extract_index\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 116\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 117\u001b[0m index \u001b[38;5;241m=\u001b[39m ensure_index(index)\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:655\u001b[0m, in \u001b[0;36m_extract_index\u001b[0;34m(data)\u001b[0m\n\u001b[1;32m 653\u001b[0m lengths \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mset\u001b[39m(raw_lengths))\n\u001b[1;32m 654\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(lengths) \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[0;32m--> 655\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAll arrays must be of the same length\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 657\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m have_dicts:\n\u001b[1;32m 658\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 659\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMixing dicts with non-Series may lead to ambiguous ordering.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 660\u001b[0m )\n",
"\u001b[0;31mValueError\u001b[0m: All arrays must be of the same length"
]
}
],
"source": [
"outdict = mutils.append_data_to_dict(mydict)\n",
"df = mutils.dump_df(outdict)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ae182eb7",
"metadata": {},
"source": [
"Check the dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c4f05637",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:57.095189Z",
"iopub.status.busy": "2023-06-27T07:38:57.094704Z",
"iopub.status.idle": "2023-06-27T07:38:57.129394Z",
"shell.execute_reply": "2023-06-27T07:38:57.128617Z"
}
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'df' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[8], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mdf\u001b[49m\u001b[38;5;241m.\u001b[39mhead(\u001b[38;5;241m10\u001b[39m)\n",
"\u001b[0;31mNameError\u001b[0m: name 'df' is not defined"
]
}
],
"source": [
"df.head(10)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "eedf1e47",
"metadata": {},
"source": [
"Write the csv file - here you should provide a file path and file name for the csv file to be written."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "bf6c9ddb",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:57.132852Z",
"iopub.status.busy": "2023-06-27T07:38:57.132387Z",
"iopub.status.idle": "2023-06-27T07:38:57.163274Z",
"shell.execute_reply": "2023-06-27T07:38:57.162553Z"
}
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'df' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[9], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Write the csv\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[43mdf\u001b[49m\u001b[38;5;241m.\u001b[39mto_csv(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m./data_out.csv\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
"\u001b[0;31mNameError\u001b[0m: name 'df' is not defined"
]
}
],
"source": [
"# Write the csv\n",
"df.to_csv(\"./data_out.csv\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "4bc8ac0a",
"metadata": {},
"source": [
"## Topic analysis\n",
"The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "4931941b",
"metadata": {},
"source": [
"BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
"\n",
"You can pass which dataframe entry you would like to have analyzed. The default is `text_english`, but you could for example also select `text_summary` or `text_english_correct` setting the keyword `analyze_text` as so:\n",
"\n",
"`ammico.text.PostprocessText(mydict=mydict, analyze_text=\"text_summary\").analyse_topic()`\n",
"\n",
"### Option 1: Use the dictionary as obtained from the above analysis."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "a3450a61",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:57.166372Z",
"iopub.status.busy": "2023-06-27T07:38:57.165937Z",
"iopub.status.idle": "2023-06-27T07:38:57.217639Z",
"shell.execute_reply": "2023-06-27T07:38:57.216808Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Reading data from dict.\n"
]
},
{
"ename": "ValueError",
"evalue": "Please check your provided dictionary - no text_english text data found.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[10], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# make a list of all the text_english entries per analysed image from the mydict variable as above\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m topic_model, topic_df, most_frequent_topics \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPostprocessText\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43mmydict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmydict\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39manalyse_topic()\n",
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:179\u001b[0m, in \u001b[0;36mPostprocessText.__init__\u001b[0;34m(self, mydict, use_csv, csv_path, analyze_text)\u001b[0m\n\u001b[1;32m 177\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReading data from dict.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 178\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict \u001b[38;5;241m=\u001b[39m mydict\n\u001b[0;32m--> 179\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlist_text_english \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_text_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43manalyze_text\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 180\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39muse_csv:\n\u001b[1;32m 181\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReading data from df.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:251\u001b[0m, in \u001b[0;36mPostprocessText.get_text_dict\u001b[0;34m(self, analyze_text)\u001b[0m\n\u001b[1;32m 249\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m key \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict\u001b[38;5;241m.\u001b[39mkeys():\n\u001b[1;32m 250\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m analyze_text \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict[key]:\n\u001b[0;32m--> 251\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 252\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease check your provided dictionary - \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m 253\u001b[0m \u001b[38;5;124m no \u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m text data found.\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mformat(\n\u001b[1;32m 254\u001b[0m analyze_text\n\u001b[1;32m 255\u001b[0m )\n\u001b[1;32m 256\u001b[0m )\n\u001b[1;32m 257\u001b[0m list_text_english\u001b[38;5;241m.\u001b[39mappend(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict[key][analyze_text])\n\u001b[1;32m 258\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m list_text_english\n",
"\u001b[0;31mValueError\u001b[0m: Please check your provided dictionary - no text_english text data found."
]
}
],
"source": [
"# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
"topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
" mydict=mydict\n",
").analyse_topic()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "95667342",
"metadata": {},
"source": [
"### Option 2: Read in a csv\n",
"Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "5530e436",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:57.221783Z",
"iopub.status.busy": "2023-06-27T07:38:57.221273Z",
"iopub.status.idle": "2023-06-27T07:38:57.282800Z",
"shell.execute_reply": "2023-06-27T07:38:57.281990Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Reading data from df.\n"
]
},
{
"ename": "ValueError",
"evalue": "Please check your provided dataframe - no text_english text data found.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[11], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m input_file_path \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdata_out.csv\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m----> 2\u001b[0m topic_model, topic_df, most_frequent_topics \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPostprocessText\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43muse_csv\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcsv_path\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minput_file_path\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39manalyse_topic(return_topics\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m10\u001b[39m)\n",
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:183\u001b[0m, in \u001b[0;36mPostprocessText.__init__\u001b[0;34m(self, mydict, use_csv, csv_path, analyze_text)\u001b[0m\n\u001b[1;32m 181\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReading data from df.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 182\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdf \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(csv_path, encoding\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mutf8\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m--> 183\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlist_text_english \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_text_df\u001b[49m\u001b[43m(\u001b[49m\u001b[43manalyze_text\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 184\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 185\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 186\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease provide either dictionary with textual data or \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m 187\u001b[0m \u001b[38;5;124m a csv file by setting `use_csv` to True and providing a \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m 188\u001b[0m \u001b[38;5;124m `csv_path`.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 189\u001b[0m )\n",
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:273\u001b[0m, in \u001b[0;36mPostprocessText.get_text_df\u001b[0;34m(self, analyze_text)\u001b[0m\n\u001b[1;32m 270\u001b[0m \u001b[38;5;66;03m# use csv file to obtain dataframe and put text_english or text_summary in list\u001b[39;00m\n\u001b[1;32m 271\u001b[0m \u001b[38;5;66;03m# check that \"text_english\" or \"text_summary\" is there\u001b[39;00m\n\u001b[1;32m 272\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m analyze_text \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdf:\n\u001b[0;32m--> 273\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 274\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease check your provided dataframe - \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m 275\u001b[0m \u001b[38;5;124m no \u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m text data found.\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mformat(\n\u001b[1;32m 276\u001b[0m analyze_text\n\u001b[1;32m 277\u001b[0m )\n\u001b[1;32m 278\u001b[0m )\n\u001b[1;32m 279\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdf[analyze_text]\u001b[38;5;241m.\u001b[39mtolist()\n",
"\u001b[0;31mValueError\u001b[0m: Please check your provided dataframe - no text_english text data found."
]
}
],
"source": [
"input_file_path = \"data_out.csv\"\n",
"topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
" use_csv=True, csv_path=input_file_path\n",
").analyse_topic(return_topics=10)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0b6ef6d7",
"metadata": {},
"source": [
"### Access frequent topics\n",
"A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:57.286871Z",
"iopub.status.busy": "2023-06-27T07:38:57.286403Z",
"iopub.status.idle": "2023-06-27T07:38:57.320459Z",
"shell.execute_reply": "2023-06-27T07:38:57.319686Z"
}
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'topic_df' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[12], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[43mtopic_df\u001b[49m)\n",
"\u001b[0;31mNameError\u001b[0m: name 'topic_df' is not defined"
]
}
],
"source": [
"print(topic_df)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b3316770",
"metadata": {},
"source": [
"### Get information for specific topic\n",
"The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "db14fe03",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:57.323822Z",
"iopub.status.busy": "2023-06-27T07:38:57.323431Z",
"iopub.status.idle": "2023-06-27T07:38:57.354044Z",
"shell.execute_reply": "2023-06-27T07:38:57.353248Z"
}
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'most_frequent_topics' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[13], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m topic \u001b[38;5;129;01min\u001b[39;00m \u001b[43mmost_frequent_topics\u001b[49m:\n\u001b[1;32m 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mTopic:\u001b[39m\u001b[38;5;124m\"\u001b[39m, topic)\n",
"\u001b[0;31mNameError\u001b[0m: name 'most_frequent_topics' is not defined"
]
}
],
"source": [
"for topic in most_frequent_topics:\n",
" print(\"Topic:\", topic)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d10f701e",
"metadata": {},
"source": [
"### Topic visualization\n",
"The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "2331afe6",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:57.357953Z",
"iopub.status.busy": "2023-06-27T07:38:57.357475Z",
"iopub.status.idle": "2023-06-27T07:38:57.389737Z",
"shell.execute_reply": "2023-06-27T07:38:57.389032Z"
}
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'topic_model' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[14], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mtopic_model\u001b[49m\u001b[38;5;241m.\u001b[39mvisualize_topics()\n",
"\u001b[0;31mNameError\u001b[0m: name 'topic_model' is not defined"
]
}
],
"source": [
"topic_model.visualize_topics()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f4eaf353",
"metadata": {},
"source": [
"### Save the model\n",
"The model can be saved for future use."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "e5e8377c",
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-27T07:38:57.393255Z",
"iopub.status.busy": "2023-06-27T07:38:57.392843Z",
"iopub.status.idle": "2023-06-27T07:38:57.427031Z",
"shell.execute_reply": "2023-06-27T07:38:57.426285Z"
}
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'topic_model' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[15], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mtopic_model\u001b[49m\u001b[38;5;241m.\u001b[39msave(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmisinfo_posts\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
"\u001b[0;31mNameError\u001b[0m: name 'topic_model' is not defined"
]
}
],
"source": [
"topic_model.save(\"misinfo_posts\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7c94edb9",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
},
"vscode": {
"interpreter": {
"hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}