зеркало из
https://github.com/ssciwr/AMMICO.git
synced 2025-10-30 21:46:04 +02:00
811 строки
56 KiB
Plaintext
811 строки
56 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "dcaa3da1",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Notebook for text extraction on image\n",
|
|
"\n",
|
|
"The text extraction and analysis is carried out using a variety of tools: \n",
|
|
"\n",
|
|
"1. Text extraction from the image using [google-cloud-vision](https://cloud.google.com/vision) \n",
|
|
"1. Language detection of the extracted text using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
|
|
"1. Translation into English or other languages using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
|
|
"1. Cleaning of the text using [spacy](https://spacy.io/) \n",
|
|
"1. Spell-check using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
|
|
"1. Subjectivity analysis using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
|
|
"1. Text summarization using [transformers](https://huggingface.co/docs/transformers/index) pipelines\n",
|
|
"1. Sentiment analysis using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
|
|
"1. Named entity recognition using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
|
|
"1. Topic analysis using [BERTopic](https://github.com/MaartenGr/BERTopic) \n",
|
|
"\n",
|
|
"The first cell is only run on google colab and installs the [ammico](https://github.com/ssciwr/AMMICO) package.\n",
|
|
"\n",
|
|
"After that, we can import `ammico` and read in the files given a folder path."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "f43f327c",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:20:46.163647Z",
|
|
"iopub.status.busy": "2023-10-20T12:20:46.163348Z",
|
|
"iopub.status.idle": "2023-10-20T12:20:46.172494Z",
|
|
"shell.execute_reply": "2023-10-20T12:20:46.171732Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# if running on google colab\n",
|
|
"# flake8-noqa-cell\n",
|
|
"import os\n",
|
|
"\n",
|
|
"if \"google.colab\" in str(get_ipython()):\n",
|
|
" # update python version\n",
|
|
" # install setuptools\n",
|
|
" # %pip install setuptools==61 -qqq\n",
|
|
" # install ammico\n",
|
|
" %pip install git+https://github.com/ssciwr/ammico.git -qqq\n",
|
|
" # mount google drive for data and API key\n",
|
|
" from google.colab import drive\n",
|
|
"\n",
|
|
" drive.mount(\"/content/drive\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "cf362e60",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:20:46.177023Z",
|
|
"iopub.status.busy": "2023-10-20T12:20:46.176316Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:03.519993Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:03.519212Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import ammico\n",
|
|
"from ammico import utils as mutils\n",
|
|
"from ammico import display as mdisplay"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "fddba721",
|
|
"metadata": {},
|
|
"source": [
|
|
"We select a subset of image files to try the text extraction on, see the `limit` keyword. The `find_files` function finds image files within a given directory: "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "27675810",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:03.526923Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:03.525805Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:03.531640Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:03.531019Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Here you need to provide the path to your google drive folder\n",
|
|
"# or local folder containing the images\n",
|
|
"images = mutils.find_files(\n",
|
|
" path=\"data/\",\n",
|
|
" limit=10,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "3a7dfe11",
|
|
"metadata": {},
|
|
"source": [
|
|
"We need to initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "8b32409f",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:03.535641Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:03.535192Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:03.539393Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:03.538742Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"mydict = mutils.initialize_dict(images)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7b8b929f",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Google cloud vision API\n",
|
|
"\n",
|
|
"For this you need an API key and have the app activated in your google console. The first 1000 images per month are free (July 2022)."
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986",
|
|
"metadata": {},
|
|
"source": [
|
|
"```\n",
|
|
"os.environ[\n",
|
|
" \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
|
|
"] = \"your-credentials.json\"\n",
|
|
"```"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "0891b795-c7fe-454c-a45d-45fadf788142",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Inspect the elements per image\n",
|
|
"To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below.\n",
|
|
"Here, we display the text extraction and translation results provided by the above libraries. Click on the tabs to see the results in the right sidebar. You may need to increment the `port` number if you are already running several notebook instances on the same server."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "7c6ecc88",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:03.543363Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:03.542918Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:04.819592Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:04.818855Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "TypeError",
|
|
"evalue": "__init__() got an unexpected keyword argument 'identify'",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[5], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m analysis_explorer \u001b[38;5;241m=\u001b[39m \u001b[43mmdisplay\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mAnalysisExplorer\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43midentify\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mtext-on-image\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2\u001b[0m analysis_explorer\u001b[38;5;241m.\u001b[39mrun_server(port\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m8054\u001b[39m)\n",
|
|
"\u001b[0;31mTypeError\u001b[0m: __init__() got an unexpected keyword argument 'identify'"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify=\"text-on-image\")\n",
|
|
"analysis_explorer.run_server(port=8054)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "9c3e72b5-0e57-4019-b45e-3e36a74e7f52",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Or directly analyze for further processing\n",
|
|
"Instead of inspecting each of the images, you can also directly carry out the analysis and export the result into a csv. This may take a while depending on how many images you have loaded. Set the keyword `analyse_text` to `True` if you want the text to be analyzed (spell check, subjectivity, text summary, sentiment, NER)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "365c78b1-7ff4-4213-86fa-6a0a2d05198f",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:04.824811Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:04.824289Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:12.540436Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:12.539186Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Collecting en-core-web-md==3.7.0\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.0/en_core_web_md-3.7.0-py3-none-any.whl (42.8 MB)\n",
|
|
"\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/42.8 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r",
|
|
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.1/42.8 MB\u001b[0m \u001b[31m4.2 MB/s\u001b[0m eta \u001b[36m0:00:11\u001b[0m\r",
|
|
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.5/42.8 MB\u001b[0m \u001b[31m7.7 MB/s\u001b[0m eta \u001b[36m0:00:06\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/42.8 MB\u001b[0m \u001b[31m9.4 MB/s\u001b[0m eta \u001b[36m0:00:05\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.5/42.8 MB\u001b[0m \u001b[31m11.0 MB/s\u001b[0m eta \u001b[36m0:00:04\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.2/42.8 MB\u001b[0m \u001b[31m12.4 MB/s\u001b[0m eta \u001b[36m0:00:04\u001b[0m"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"\u001b[2K \u001b[91m━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.9/42.8 MB\u001b[0m \u001b[31m13.7 MB/s\u001b[0m eta \u001b[36m0:00:03\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.9/42.8 MB\u001b[0m \u001b[31m15.6 MB/s\u001b[0m eta \u001b[36m0:00:03\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.0/42.8 MB\u001b[0m \u001b[31m17.8 MB/s\u001b[0m eta \u001b[36m0:00:03\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.4/42.8 MB\u001b[0m \u001b[31m20.1 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m8.0/42.8 MB\u001b[0m \u001b[31m22.7 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m10.1/42.8 MB\u001b[0m \u001b[31m25.9 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.1/42.8 MB\u001b[0m \u001b[31m39.5 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m15.0/42.8 MB\u001b[0m \u001b[31m54.9 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m18.7/42.8 MB\u001b[0m \u001b[31m73.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m23.1/42.8 MB\u001b[0m \u001b[31m105.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━\u001b[0m \u001b[32m25.7/42.8 MB\u001b[0m \u001b[31m99.0 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━\u001b[0m \u001b[32m30.2/42.8 MB\u001b[0m \u001b[31m107.3 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━\u001b[0m \u001b[32m35.0/42.8 MB\u001b[0m \u001b[31m107.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━\u001b[0m \u001b[32m40.3/42.8 MB\u001b[0m \u001b[31m138.5 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m146.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m146.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
|
|
"\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m146.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
|
|
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m42.8/42.8 MB\u001b[0m \u001b[31m57.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
|
|
"\u001b[?25h"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Requirement already satisfied: spacy<3.8.0,>=3.7.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from en-core-web-md==3.7.0) (3.7.2)\n",
|
|
"Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.0.12)\n",
|
|
"Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.0.5)\n",
|
|
"Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.0.10)\n",
|
|
"Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.0.8)\n",
|
|
"Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.0.9)\n",
|
|
"Requirement already satisfied: thinc<8.3.0,>=8.1.8 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (8.2.1)\n",
|
|
"Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.1.2)\n",
|
|
"Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.4.8)\n",
|
|
"Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.0.10)\n",
|
|
"Requirement already satisfied: weasel<0.4.0,>=0.1.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.3.3)\n",
|
|
"Requirement already satisfied: typer<0.10.0,>=0.3.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.9.0)\n",
|
|
"Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (6.4.0)\n",
|
|
"Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (4.66.1)\n",
|
|
"Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.31.0)\n",
|
|
"Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.10.13)\n",
|
|
"Requirement already satisfied: jinja2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.1.2)\n",
|
|
"Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (58.1.0)\n",
|
|
"Requirement already satisfied: packaging>=20.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (23.2)\n",
|
|
"Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.3.0)\n",
|
|
"Requirement already satisfied: numpy>=1.19.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (1.23.4)\n",
|
|
"Requirement already satisfied: typing-extensions>=4.2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (4.5.0)\n",
|
|
"Requirement already satisfied: charset-normalizer<4,>=2 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (3.3.0)\n",
|
|
"Requirement already satisfied: idna<4,>=2.5 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.10)\n",
|
|
"Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.0.7)\n",
|
|
"Requirement already satisfied: certifi>=2017.4.17 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2023.7.22)\n",
|
|
"Requirement already satisfied: blis<0.8.0,>=0.7.8 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from thinc<8.3.0,>=8.1.8->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.7.11)\n",
|
|
"Requirement already satisfied: confection<1.0.0,>=0.0.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from thinc<8.3.0,>=8.1.8->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.1.3)\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Requirement already satisfied: click<9.0.0,>=7.1.1 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from typer<0.10.0,>=0.3.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (8.1.7)\n",
|
|
"Requirement already satisfied: cloudpathlib<0.17.0,>=0.7.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from weasel<0.4.0,>=0.1.0->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (0.16.0)\n",
|
|
"Requirement already satisfied: MarkupSafe>=2.0 in /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages (from jinja2->spacy<3.8.0,>=3.7.0->en-core-web-md==3.7.0) (2.1.3)\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Installing collected packages: en-core-web-md\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Successfully installed en-core-web-md-3.7.0\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n",
|
|
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3\u001b[0m\n",
|
|
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
|
|
"You can now load the package via spacy.load('en_core_web_md')\n"
|
|
]
|
|
},
|
|
{
|
|
"ename": "FileNotFoundError",
|
|
"evalue": "[Errno 2] No such file or directory: '102141_2_eng'",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[6], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m key \u001b[38;5;129;01min\u001b[39;00m mydict:\n\u001b[0;32m----> 2\u001b[0m mydict[key] \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mTextDetector\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m[\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43manalyse_text\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\n\u001b[1;32m 4\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43manalyse_image\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
|
|
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:158\u001b[0m, in \u001b[0;36mTextDetector.analyse_image\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 152\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21manalyse_image\u001b[39m(\u001b[38;5;28mself\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mdict\u001b[39m:\n\u001b[1;32m 153\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Perform text extraction and analysis of the text.\u001b[39;00m\n\u001b[1;32m 154\u001b[0m \n\u001b[1;32m 155\u001b[0m \u001b[38;5;124;03m Returns:\u001b[39;00m\n\u001b[1;32m 156\u001b[0m \u001b[38;5;124;03m dict: The updated dictionary with text analysis results.\u001b[39;00m\n\u001b[1;32m 157\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 158\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_text_from_image\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 159\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtranslate_text()\n\u001b[1;32m 160\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mremove_linebreaks()\n",
|
|
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:178\u001b[0m, in \u001b[0;36mTextDetector.get_text_from_image\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 174\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m DefaultCredentialsError:\n\u001b[1;32m 175\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m DefaultCredentialsError(\n\u001b[1;32m 176\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease provide credentials for google cloud vision API, see https://cloud.google.com/docs/authentication/application-default-credentials.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 177\u001b[0m )\n\u001b[0;32m--> 178\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[43mio\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mopen\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mrb\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m \u001b[38;5;28;01mas\u001b[39;00m image_file:\n\u001b[1;32m 179\u001b[0m content \u001b[38;5;241m=\u001b[39m image_file\u001b[38;5;241m.\u001b[39mread()\n\u001b[1;32m 180\u001b[0m image \u001b[38;5;241m=\u001b[39m vision\u001b[38;5;241m.\u001b[39mImage(content\u001b[38;5;241m=\u001b[39mcontent)\n",
|
|
"\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '102141_2_eng'"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for key in mydict:\n",
|
|
" mydict[key] = ammico.text.TextDetector(\n",
|
|
" mydict[key], analyse_text=True\n",
|
|
" ).analyse_image()"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "3c063eda",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Convert to dataframe and write csv\n",
|
|
"These steps are required to convert the dictionary of dictionarys into a dictionary with lists, that can be converted into a pandas dataframe and exported to a csv file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "5709c2cd",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:12.545579Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:12.544933Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:13.263838Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:13.261908Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "ValueError",
|
|
"evalue": "All arrays must be of the same length",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[7], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m outdict \u001b[38;5;241m=\u001b[39m mutils\u001b[38;5;241m.\u001b[39mappend_data_to_dict(mydict)\n\u001b[0;32m----> 2\u001b[0m df \u001b[38;5;241m=\u001b[39m \u001b[43mmutils\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdump_df\u001b[49m\u001b[43m(\u001b[49m\u001b[43moutdict\u001b[49m\u001b[43m)\u001b[49m\n",
|
|
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/utils.py:222\u001b[0m, in \u001b[0;36mdump_df\u001b[0;34m(mydict)\u001b[0m\n\u001b[1;32m 220\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mdump_df\u001b[39m(mydict: \u001b[38;5;28mdict\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m DataFrame:\n\u001b[1;32m 221\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Utility to dump the dictionary into a dataframe.\"\"\"\u001b[39;00m\n\u001b[0;32m--> 222\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mDataFrame\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m)\u001b[49m\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/frame.py:1816\u001b[0m, in \u001b[0;36mDataFrame.from_dict\u001b[0;34m(cls, data, orient, dtype, columns)\u001b[0m\n\u001b[1;32m 1810\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 1811\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mExpected \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mindex\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m, \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcolumns\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m or \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mtight\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m for orient parameter. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1812\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mGot \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00morient\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m instead\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1813\u001b[0m )\n\u001b[1;32m 1815\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m orient \u001b[38;5;241m!=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtight\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[0;32m-> 1816\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mcls\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1817\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 1818\u001b[0m realdata \u001b[38;5;241m=\u001b[39m data[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdata\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/frame.py:736\u001b[0m, in \u001b[0;36mDataFrame.__init__\u001b[0;34m(self, data, index, columns, dtype, copy)\u001b[0m\n\u001b[1;32m 730\u001b[0m mgr \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_init_mgr(\n\u001b[1;32m 731\u001b[0m data, axes\u001b[38;5;241m=\u001b[39m{\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mindex\u001b[39m\u001b[38;5;124m\"\u001b[39m: index, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcolumns\u001b[39m\u001b[38;5;124m\"\u001b[39m: columns}, dtype\u001b[38;5;241m=\u001b[39mdtype, copy\u001b[38;5;241m=\u001b[39mcopy\n\u001b[1;32m 732\u001b[0m )\n\u001b[1;32m 734\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, \u001b[38;5;28mdict\u001b[39m):\n\u001b[1;32m 735\u001b[0m \u001b[38;5;66;03m# GH#38939 de facto copy defaults to False only in non-dict cases\u001b[39;00m\n\u001b[0;32m--> 736\u001b[0m mgr \u001b[38;5;241m=\u001b[39m \u001b[43mdict_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcopy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmanager\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 737\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, ma\u001b[38;5;241m.\u001b[39mMaskedArray):\n\u001b[1;32m 738\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mma\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m mrecords\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:503\u001b[0m, in \u001b[0;36mdict_to_mgr\u001b[0;34m(data, index, columns, dtype, typ, copy)\u001b[0m\n\u001b[1;32m 499\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 500\u001b[0m \u001b[38;5;66;03m# dtype check to exclude e.g. range objects, scalars\u001b[39;00m\n\u001b[1;32m 501\u001b[0m arrays \u001b[38;5;241m=\u001b[39m [x\u001b[38;5;241m.\u001b[39mcopy() \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(x, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdtype\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01melse\u001b[39;00m x \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m arrays]\n\u001b[0;32m--> 503\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43marrays_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtyp\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mconsolidate\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m)\u001b[49m\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:114\u001b[0m, in \u001b[0;36marrays_to_mgr\u001b[0;34m(arrays, columns, index, dtype, verify_integrity, typ, consolidate)\u001b[0m\n\u001b[1;32m 111\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m verify_integrity:\n\u001b[1;32m 112\u001b[0m \u001b[38;5;66;03m# figure out the index, if necessary\u001b[39;00m\n\u001b[1;32m 113\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m index \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m--> 114\u001b[0m index \u001b[38;5;241m=\u001b[39m \u001b[43m_extract_index\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 115\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 116\u001b[0m index \u001b[38;5;241m=\u001b[39m ensure_index(index)\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/internals/construction.py:677\u001b[0m, in \u001b[0;36m_extract_index\u001b[0;34m(data)\u001b[0m\n\u001b[1;32m 675\u001b[0m lengths \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mset\u001b[39m(raw_lengths))\n\u001b[1;32m 676\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(lengths) \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[0;32m--> 677\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAll arrays must be of the same length\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 679\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m have_dicts:\n\u001b[1;32m 680\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 681\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMixing dicts with non-Series may lead to ambiguous ordering.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 682\u001b[0m )\n",
|
|
"\u001b[0;31mValueError\u001b[0m: All arrays must be of the same length"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"outdict = mutils.append_data_to_dict(mydict)\n",
|
|
"df = mutils.dump_df(outdict)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "ae182eb7",
|
|
"metadata": {},
|
|
"source": [
|
|
"Check the dataframe:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "c4f05637",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:13.268530Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:13.267995Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:13.311720Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:13.310827Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "NameError",
|
|
"evalue": "name 'df' is not defined",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[8], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mdf\u001b[49m\u001b[38;5;241m.\u001b[39mhead(\u001b[38;5;241m10\u001b[39m)\n",
|
|
"\u001b[0;31mNameError\u001b[0m: name 'df' is not defined"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"df.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "eedf1e47",
|
|
"metadata": {},
|
|
"source": [
|
|
"Write the csv file - here you should provide a file path and file name for the csv file to be written."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "bf6c9ddb",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:13.316399Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:13.315650Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:13.361379Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:13.360605Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "NameError",
|
|
"evalue": "name 'df' is not defined",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[9], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Write the csv\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[43mdf\u001b[49m\u001b[38;5;241m.\u001b[39mto_csv(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m./data_out.csv\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
|
|
"\u001b[0;31mNameError\u001b[0m: name 'df' is not defined"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Write the csv\n",
|
|
"df.to_csv(\"./data_out.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "4bc8ac0a",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Topic analysis\n",
|
|
"The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "4931941b",
|
|
"metadata": {},
|
|
"source": [
|
|
"BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
|
|
"\n",
|
|
"You can pass which dataframe entry you would like to have analyzed. The default is `text_english`, but you could for example also select `text_summary` or `text_english_correct` setting the keyword `analyze_text` as so:\n",
|
|
"\n",
|
|
"`ammico.text.PostprocessText(mydict=mydict, analyze_text=\"text_summary\").analyse_topic()`\n",
|
|
"\n",
|
|
"### Option 1: Use the dictionary as obtained from the above analysis."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "a3450a61",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:13.366656Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:13.366030Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:13.434827Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:13.434058Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Reading data from dict.\n"
|
|
]
|
|
},
|
|
{
|
|
"ename": "ValueError",
|
|
"evalue": "Please check your provided dictionary - no text_english text data found.",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[10], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# make a list of all the text_english entries per analysed image from the mydict variable as above\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m topic_model, topic_df, most_frequent_topics \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPostprocessText\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43mmydict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmydict\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39manalyse_topic()\n",
|
|
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:303\u001b[0m, in \u001b[0;36mPostprocessText.__init__\u001b[0;34m(self, mydict, use_csv, csv_path, analyze_text)\u001b[0m\n\u001b[1;32m 301\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReading data from dict.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 302\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict \u001b[38;5;241m=\u001b[39m mydict\n\u001b[0;32m--> 303\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlist_text_english \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_text_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43manalyze_text\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 304\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39muse_csv:\n\u001b[1;32m 305\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReading data from df.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
|
|
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:375\u001b[0m, in \u001b[0;36mPostprocessText.get_text_dict\u001b[0;34m(self, analyze_text)\u001b[0m\n\u001b[1;32m 373\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m key \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict\u001b[38;5;241m.\u001b[39mkeys():\n\u001b[1;32m 374\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m analyze_text \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict[key]:\n\u001b[0;32m--> 375\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 376\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease check your provided dictionary - \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m 377\u001b[0m \u001b[38;5;124m no \u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m text data found.\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mformat(\n\u001b[1;32m 378\u001b[0m analyze_text\n\u001b[1;32m 379\u001b[0m )\n\u001b[1;32m 380\u001b[0m )\n\u001b[1;32m 381\u001b[0m list_text_english\u001b[38;5;241m.\u001b[39mappend(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmydict[key][analyze_text])\n\u001b[1;32m 382\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m list_text_english\n",
|
|
"\u001b[0;31mValueError\u001b[0m: Please check your provided dictionary - no text_english text data found."
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
|
|
"topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
|
|
" mydict=mydict\n",
|
|
").analyse_topic()"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "95667342",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Option 2: Read in a csv\n",
|
|
"Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"id": "5530e436",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:13.439138Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:13.437740Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:13.505594Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:13.504797Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Reading data from df.\n"
|
|
]
|
|
},
|
|
{
|
|
"ename": "ValueError",
|
|
"evalue": "Please check your provided dataframe - no text_english text data found.",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[11], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m input_file_path \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdata_out.csv\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m----> 2\u001b[0m topic_model, topic_df, most_frequent_topics \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPostprocessText\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43muse_csv\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcsv_path\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minput_file_path\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39manalyse_topic(return_topics\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m10\u001b[39m)\n",
|
|
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:307\u001b[0m, in \u001b[0;36mPostprocessText.__init__\u001b[0;34m(self, mydict, use_csv, csv_path, analyze_text)\u001b[0m\n\u001b[1;32m 305\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReading data from df.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 306\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdf \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(csv_path, encoding\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mutf8\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m--> 307\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlist_text_english \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_text_df\u001b[49m\u001b[43m(\u001b[49m\u001b[43manalyze_text\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 308\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 309\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 310\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease provide either dictionary with textual data or \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m 311\u001b[0m \u001b[38;5;124m a csv file by setting `use_csv` to True and providing a \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m 312\u001b[0m \u001b[38;5;124m `csv_path`.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 313\u001b[0m )\n",
|
|
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:397\u001b[0m, in \u001b[0;36mPostprocessText.get_text_df\u001b[0;34m(self, analyze_text)\u001b[0m\n\u001b[1;32m 394\u001b[0m \u001b[38;5;66;03m# use csv file to obtain dataframe and put text_english or text_summary in list\u001b[39;00m\n\u001b[1;32m 395\u001b[0m \u001b[38;5;66;03m# check that \"text_english\" or \"text_summary\" is there\u001b[39;00m\n\u001b[1;32m 396\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m analyze_text \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdf:\n\u001b[0;32m--> 397\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 398\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlease check your provided dataframe - \u001b[39m\u001b[38;5;130;01m\\\u001b[39;00m\n\u001b[1;32m 399\u001b[0m \u001b[38;5;124m no \u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m text data found.\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mformat(\n\u001b[1;32m 400\u001b[0m analyze_text\n\u001b[1;32m 401\u001b[0m )\n\u001b[1;32m 402\u001b[0m )\n\u001b[1;32m 403\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdf[analyze_text]\u001b[38;5;241m.\u001b[39mtolist()\n",
|
|
"\u001b[0;31mValueError\u001b[0m: Please check your provided dataframe - no text_english text data found."
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"input_file_path = \"data_out.csv\"\n",
|
|
"topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
|
|
" use_csv=True, csv_path=input_file_path\n",
|
|
").analyse_topic(return_topics=10)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "0b6ef6d7",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Access frequent topics\n",
|
|
"A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:13.509125Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:13.508664Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:13.551069Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:13.550365Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "NameError",
|
|
"evalue": "name 'topic_df' is not defined",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[12], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[43mtopic_df\u001b[49m)\n",
|
|
"\u001b[0;31mNameError\u001b[0m: name 'topic_df' is not defined"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(topic_df)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "b3316770",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Get information for specific topic\n",
|
|
"The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"id": "db14fe03",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:13.554591Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:13.553971Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:13.595171Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:13.594404Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "NameError",
|
|
"evalue": "name 'most_frequent_topics' is not defined",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[13], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m topic \u001b[38;5;129;01min\u001b[39;00m \u001b[43mmost_frequent_topics\u001b[49m:\n\u001b[1;32m 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mTopic:\u001b[39m\u001b[38;5;124m\"\u001b[39m, topic)\n",
|
|
"\u001b[0;31mNameError\u001b[0m: name 'most_frequent_topics' is not defined"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for topic in most_frequent_topics:\n",
|
|
" print(\"Topic:\", topic)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "d10f701e",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Topic visualization\n",
|
|
"The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"id": "2331afe6",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:13.598899Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:13.598199Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:13.637533Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:13.636790Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "NameError",
|
|
"evalue": "name 'topic_model' is not defined",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[14], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mtopic_model\u001b[49m\u001b[38;5;241m.\u001b[39mvisualize_topics()\n",
|
|
"\u001b[0;31mNameError\u001b[0m: name 'topic_model' is not defined"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"topic_model.visualize_topics()"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "f4eaf353",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Save the model\n",
|
|
"The model can be saved for future use."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"id": "e5e8377c",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-10-20T12:21:13.641364Z",
|
|
"iopub.status.busy": "2023-10-20T12:21:13.640744Z",
|
|
"iopub.status.idle": "2023-10-20T12:21:13.680230Z",
|
|
"shell.execute_reply": "2023-10-20T12:21:13.679510Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "NameError",
|
|
"evalue": "name 'topic_model' is not defined",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[15], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mtopic_model\u001b[49m\u001b[38;5;241m.\u001b[39msave(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmisinfo_posts\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
|
|
"\u001b[0;31mNameError\u001b[0m: name 'topic_model' is not defined"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"topic_model.save(\"misinfo_posts\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7c94edb9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.18"
|
|
},
|
|
"vscode": {
|
|
"interpreter": {
|
|
"hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2"
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|