зеркало из
				https://github.com/ssciwr/AMMICO.git
				synced 2025-10-30 21:46:04 +02:00 
			
		
		
		
	 a6578cfdf3
			
		
	
	
		a6578cfdf3
		
			
		
	
	
	
	
		
			
			* add bertopic to requirements * basic topic modeling * topic modeling using BERT; bugfix if no text on post * update for google colab * Catch connection errors * replace newline character with space * move topic analysis into PostprocessText class * set up dataflow topic analysis * expose topic model to UI * tests for class init * tests for topic analysis * more tests * take care of carriage return on windows * take care of carriage return on windows * take care of carriage return on windows * set encoding to ensure windows compatibility * track encoding error * more debug * skip topic analysis debug * windows fixes
		
			
				
	
	
		
			361 строка
		
	
	
		
			8.3 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
		
			Generated
		
	
	
			
		
		
	
	
			361 строка
		
	
	
		
			8.3 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
		
			Generated
		
	
	
| {
 | |
|  "cells": [
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "dcaa3da1",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "# Notebook for text extraction on image\n",
 | |
|     "Inga Ulusoy, SSC, July 2022"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "f43f327c",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# if running on google colab\n",
 | |
|     "# flake8-noqa-cell\n",
 | |
|     "import os\n",
 | |
|     "\n",
 | |
|     "if \"google.colab\" in str(get_ipython()):\n",
 | |
|     "    # update python version\n",
 | |
|     "    # install setuptools\n",
 | |
|     "    !pip install setuptools==61 -qqq\n",
 | |
|     "    # install misinformation\n",
 | |
|     "    !pip install git+https://github.com/ssciwr/misinformation.git -qqq\n",
 | |
|     "    # mount google drive for data and API key\n",
 | |
|     "    from google.colab import drive\n",
 | |
|     "\n",
 | |
|     "    drive.mount(\"/content/drive\")"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "cf362e60",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "import os\n",
 | |
|     "from IPython.display import Image, display\n",
 | |
|     "import misinformation\n",
 | |
|     "import tensorflow as tf\n",
 | |
|     "\n",
 | |
|     "print(tf.config.list_physical_devices(\"GPU\"))"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "27675810",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# download the models if they are not there yet\n",
 | |
|     "!python -m spacy download en_core_web_md\n",
 | |
|     "!python -m textblob.download_corpora"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "6da3a7aa",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "images = misinformation.find_files(path=\"../data/all/\", limit=1000)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "bf811ce0",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "for i in images[0:3]:\n",
 | |
|     "    display(Image(filename=i))"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "8b32409f",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "mydict = misinformation.utils.initialize_dict(images[0:3])"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "7b8b929f",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "# google cloud vision API\n",
 | |
|     "First 1000 images per month are free."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "os.environ[\n",
 | |
|     "    \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
 | |
|     "] = \"../data/misinformation-campaign-981aa55a3b13.json\""
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "0891b795-c7fe-454c-a45d-45fadf788142",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Inspect the elements per image"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "7c6ecc88",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "misinformation.explore_analysis(mydict, identify=\"text-on-image\")"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "9c3e72b5-0e57-4019-b45e-3e36a74e7f52",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Or directly analyze for further processing"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "365c78b1-7ff4-4213-86fa-6a0a2d05198f",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "for key in mydict:\n",
 | |
|     "    print(key)\n",
 | |
|     "    mydict[key] = misinformation.text.TextDetector(\n",
 | |
|     "        mydict[key], analyse_text=True\n",
 | |
|     "    ).analyse_image()"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "3c063eda",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Convert to dataframe and write csv"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "5709c2cd",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "outdict = misinformation.utils.append_data_to_dict(mydict)\n",
 | |
|     "df = misinformation.utils.dump_df(outdict)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "c4f05637",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# check the dataframe\n",
 | |
|     "df.head(10)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "bf6c9ddb",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# Write the csv\n",
 | |
|     "df.to_csv(\"./data_out.csv\")"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "4bc8ac0a",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "# Topic analysis\n",
 | |
|     "The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "4931941b",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
 | |
|     "### Option 1: Use the dictionary as obtained from the above analysis."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "a3450a61",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
 | |
|     "topic_model, topic_df, most_frequent_topics = misinformation.text.PostprocessText(\n",
 | |
|     "    mydict=mydict\n",
 | |
|     ").analyse_topic()"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "95667342",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Option 2: Read in a csv\n",
 | |
|     "Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "5530e436",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "input_file_path = \"data_out.csv\"\n",
 | |
|     "topic_model, topic_df, most_frequent_topics = misinformation.text.PostprocessText(\n",
 | |
|     "    use_csv=True, csv_path=input_file_path\n",
 | |
|     ").analyse_topic(return_topics=10)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "0b6ef6d7",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Access frequent topics\n",
 | |
|     "A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "print(topic_df)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "b3316770",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Get information for specific topic\n",
 | |
|     "The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "db14fe03",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "for topic in most_frequent_topics:\n",
 | |
|     "    print(\"Topic:\", topic)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "d10f701e",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Topic visualization\n",
 | |
|     "The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "2331afe6",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "topic_model.visualize_topics()"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "f4eaf353",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Save the model\n",
 | |
|     "The model can be saved for future use."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "e5e8377c",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "topic_model.save(\"misinfo_posts\")"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "7c94edb9",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": []
 | |
|   }
 | |
|  ],
 | |
|  "metadata": {
 | |
|   "kernelspec": {
 | |
|    "display_name": "Python 3 (ipykernel)",
 | |
|    "language": "python",
 | |
|    "name": "python3"
 | |
|   },
 | |
|   "language_info": {
 | |
|    "codemirror_mode": {
 | |
|     "name": "ipython",
 | |
|     "version": 3
 | |
|    },
 | |
|    "file_extension": ".py",
 | |
|    "mimetype": "text/x-python",
 | |
|    "name": "python",
 | |
|    "nbconvert_exporter": "python",
 | |
|    "pygments_lexer": "ipython3",
 | |
|    "version": "3.10.6"
 | |
|   },
 | |
|   "vscode": {
 | |
|    "interpreter": {
 | |
|     "hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2"
 | |
|    }
 | |
|   }
 | |
|  },
 | |
|  "nbformat": 4,
 | |
|  "nbformat_minor": 5
 | |
| }
 |