зеркало из
				https://github.com/ssciwr/AMMICO.git
				synced 2025-10-30 21:46:04 +02:00 
			
		
		
		
	![pre-commit-ci[bot]](/assets/img/avatar_default.png) fcb2d55740
			
		
	
	
		fcb2d55740
		
			
		
	
	
	
	
		
			
			* [pre-commit.ci] pre-commit autoupdate updates: - [github.com/kynan/nbstripout: 0.6.1 → 0.7.1](https://github.com/kynan/nbstripout/compare/0.6.1...0.7.1) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
		
			
				
	
	
		
			372 строки
		
	
	
		
			11 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			372 строки
		
	
	
		
			11 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| {
 | |
|  "cells": [
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "0",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "# Notebook for text extraction on image\n",
 | |
|     "\n",
 | |
|     "The text extraction and analysis is carried out using a variety of tools:  \n",
 | |
|     "\n",
 | |
|     "1. Text extraction from the image using [google-cloud-vision](https://cloud.google.com/vision)  \n",
 | |
|     "1. Language detection of the extracted text using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/)  \n",
 | |
|     "1. Translation into English or other languages using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
 | |
|     "1. Cleaning of the text using [spacy](https://spacy.io/) \n",
 | |
|     "1. Spell-check using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
 | |
|     "1. Subjectivity analysis using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
 | |
|     "1. Text summarization using [transformers](https://huggingface.co/docs/transformers/index) pipelines\n",
 | |
|     "1. Sentiment analysis using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
 | |
|     "1. Named entity recognition using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
 | |
|     "1. Topic analysis using [BERTopic](https://github.com/MaartenGr/BERTopic) \n",
 | |
|     "\n",
 | |
|     "The first cell is only run on google colab and installs the [ammico](https://github.com/ssciwr/AMMICO) package.\n",
 | |
|     "\n",
 | |
|     "After that, we can import `ammico` and read in the files given a folder path."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "1",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# if running on google colab\n",
 | |
|     "# flake8-noqa-cell\n",
 | |
|     "import os\n",
 | |
|     "\n",
 | |
|     "if \"google.colab\" in str(get_ipython()):\n",
 | |
|     "    # update python version\n",
 | |
|     "    # install setuptools\n",
 | |
|     "    # %pip install setuptools==61 -qqq\n",
 | |
|     "    # uninstall some pre-installed packages due to incompatibility\n",
 | |
|     "    %pip uninstall tensorflow-probability dopamine-rl lida pandas-gbq torchaudio torchdata torchtext orbax-checkpoint -y -qqq\n",
 | |
|     "    # install ammico\n",
 | |
|     "    %pip install git+https://github.com/ssciwr/ammico.git -qqq\n",
 | |
|     "    # mount google drive for data and API key\n",
 | |
|     "    from google.colab import drive\n",
 | |
|     "\n",
 | |
|     "    drive.mount(\"/content/drive\")"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "2",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "import os\n",
 | |
|     "import ammico"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "3",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "We select a subset of image files to try the text extraction on, see the `limit` keyword. The `find_files` function finds image files within a given directory and initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis: "
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "4",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# Here you need to provide the path to your google drive folder\n",
 | |
|     "# or local folder containing the images\n",
 | |
|     "mydict = ammico.find_files(\n",
 | |
|     "    path=\"/content/drive/MyDrive/misinformation-data/\",\n",
 | |
|     "    limit=10,\n",
 | |
|     ")\n",
 | |
|     "mydict"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "5",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "# Google cloud vision API\n",
 | |
|     "\n",
 | |
|     "For this you need an API key and have the app activated in your google console. The first 1000 images per month are free (July 2022)."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "6",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "os.environ[\n",
 | |
|     "    \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
 | |
|     "] = \"/content/drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\""
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "7",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Inspect the elements per image\n",
 | |
|     "To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below.\n",
 | |
|     "Here, we display the text extraction and translation results provided by the above libraries. Click on the tabs to see the results in the right sidebar. You may need to increment the `port` number if you are already running several notebook instances on the same server."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "8",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "analysis_explorer = ammico.AnalysisExplorer(mydict)\n",
 | |
|     "analysis_explorer.run_server(port=8054)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "9",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Or directly analyze for further processing\n",
 | |
|     "Instead of inspecting each of the images, you can also directly carry out the analysis and export the result into a csv. This may take a while depending on how many images you have loaded. Set the keyword `analyse_text` to `True` if you want the text to be analyzed (spell check, subjectivity, text summary, sentiment, NER)."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "10",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "for key in mydict:\n",
 | |
|     "    mydict[key] = ammico.TextDetector(\n",
 | |
|     "        mydict[key], analyse_text=True\n",
 | |
|     "    ).analyse_image()"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "11",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Convert to dataframe and write csv\n",
 | |
|     "These steps are required to convert the dictionary of dictionarys into a dictionary with lists, that can be converted into a pandas dataframe and exported to a csv file."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "12",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "df = ammico.get_dataframe(mydict)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "13",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "Check the dataframe:"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "14",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "df.head(10)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "15",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "Write the csv file - here you should provide a file path and file name for the csv file to be written."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "16",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# Write the csv\n",
 | |
|     "df.to_csv(\"/content/drive/MyDrive/misinformation-data/data_out.csv\")"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "17",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "# Topic analysis\n",
 | |
|     "The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "18",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
 | |
|     "\n",
 | |
|     "You can pass which dataframe entry you would like to have analyzed. The default is `text_english`, but you could for example also select `text_summary` or `text_english_correct` setting the keyword `analyze_text` as so:\n",
 | |
|     "\n",
 | |
|     "`ammico.text.PostprocessText(mydict=mydict, analyze_text=\"text_summary\").analyse_topic()`\n",
 | |
|     "\n",
 | |
|     "### Option 1: Use the dictionary as obtained from the above analysis."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "19",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
 | |
|     "topic_model, topic_df, most_frequent_topics = ammico.PostprocessText(\n",
 | |
|     "    mydict=mydict\n",
 | |
|     ").analyse_topic()"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "20",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Option 2: Read in a csv\n",
 | |
|     "Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "21",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "input_file_path = \"/content/drive/MyDrive/misinformation-data/data_out.csv\"\n",
 | |
|     "topic_model, topic_df, most_frequent_topics = ammico.PostprocessText(\n",
 | |
|     "    use_csv=True, csv_path=input_file_path\n",
 | |
|     ").analyse_topic(return_topics=10)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "22",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Access frequent topics\n",
 | |
|     "A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "23",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "print(topic_df)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "24",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Get information for specific topic\n",
 | |
|     "The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "25",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "for topic in most_frequent_topics:\n",
 | |
|     "    print(\"Topic:\", topic)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "26",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Topic visualization\n",
 | |
|     "The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "27",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "topic_model.visualize_topics()"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "28",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Save the model\n",
 | |
|     "The model can be saved for future use."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "29",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "topic_model.save(\"misinfo_posts\")"
 | |
|    ]
 | |
|   }
 | |
|  ],
 | |
|  "metadata": {
 | |
|   "kernelspec": {
 | |
|    "display_name": "Python 3 (ipykernel)",
 | |
|    "language": "python",
 | |
|    "name": "python3"
 | |
|   },
 | |
|   "language_info": {
 | |
|    "codemirror_mode": {
 | |
|     "name": "ipython",
 | |
|     "version": 3
 | |
|    },
 | |
|    "file_extension": ".py",
 | |
|    "mimetype": "text/x-python",
 | |
|    "name": "python",
 | |
|    "nbconvert_exporter": "python",
 | |
|    "pygments_lexer": "ipython3",
 | |
|    "version": "3.9.16"
 | |
|   },
 | |
|   "vscode": {
 | |
|    "interpreter": {
 | |
|     "hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2"
 | |
|    }
 | |
|   }
 | |
|  },
 | |
|  "nbformat": 4,
 | |
|  "nbformat_minor": 5
 | |
| }
 |