зеркало из
https://github.com/ssciwr/AMMICO.git
synced 2025-10-30 13:36:04 +02:00
* use pip with ipython magic and not terminal * loosen setuptools version for colab notebooks * update install instructions
398 строки
12 KiB
Plaintext
Generated
398 строки
12 KiB
Plaintext
Generated
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dcaa3da1",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Notebook for text extraction on image\n",
|
|
"\n",
|
|
"The text extraction and analysis is carried out using a variety of tools: \n",
|
|
"\n",
|
|
"1. Text extraction from the image using [google-cloud-vision](https://cloud.google.com/vision) \n",
|
|
"1. Language detection of the extracted text using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
|
|
"1. Translation into English or other languages using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
|
|
"1. Cleaning of the text using [spacy](https://spacy.io/) \n",
|
|
"1. Spell-check using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
|
|
"1. Subjectivity analysis using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
|
|
"1. Text summarization using [transformers](https://huggingface.co/docs/transformers/index) pipelines\n",
|
|
"1. Sentiment analysis using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
|
|
"1. Named entity recognition using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
|
|
"1. Topic analysis using [BERTopic](https://github.com/MaartenGr/BERTopic) \n",
|
|
"\n",
|
|
"The first cell is only run on google colab and installs the [ammico](https://github.com/ssciwr/AMMICO) package.\n",
|
|
"\n",
|
|
"After that, we can import `ammico` and read in the files given a folder path."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "f43f327c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# if running on google colab\n",
|
|
"# flake8-noqa-cell\n",
|
|
"import os\n",
|
|
"\n",
|
|
"if \"google.colab\" in str(get_ipython()):\n",
|
|
" # update python version\n",
|
|
" # install setuptools\n",
|
|
" # %pip install setuptools==61 -qqq\n",
|
|
" # install ammico\n",
|
|
" %pip install git+https://github.com/ssciwr/ammico.git -qqq\n",
|
|
" # mount google drive for data and API key\n",
|
|
" from google.colab import drive\n",
|
|
"\n",
|
|
" drive.mount(\"/content/drive\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cf362e60",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import ammico\n",
|
|
"from ammico import utils as mutils\n",
|
|
"from ammico import display as mdisplay"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "fddba721",
|
|
"metadata": {},
|
|
"source": [
|
|
"We select a subset of image files to try the text extraction on, see the `limit` keyword. The `find_files` function finds image files within a given directory: "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "27675810",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Here you need to provide the path to your google drive folder\n",
|
|
"# or local folder containing the images\n",
|
|
"images = mutils.find_files(\n",
|
|
" path=\"/content/drive/MyDrive/misinformation-data/\",\n",
|
|
" limit=10,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3a7dfe11",
|
|
"metadata": {},
|
|
"source": [
|
|
"We need to initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8b32409f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"mydict = mutils.initialize_dict(images)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7b8b929f",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Google cloud vision API\n",
|
|
"\n",
|
|
"For this you need an API key and have the app activated in your google console. The first 1000 images per month are free (July 2022)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"os.environ[\n",
|
|
" \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
|
|
"] = \"/content/drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0891b795-c7fe-454c-a45d-45fadf788142",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Inspect the elements per image\n",
|
|
"To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below.\n",
|
|
"Here, we display the text extraction and translation results provided by the above libraries. Click on the tabs to see the results in the right sidebar. You may need to increment the `port` number if you are already running several notebook instances on the same server."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7c6ecc88",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify=\"text-on-image\")\n",
|
|
"analysis_explorer.run_server(port=8054)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9c3e72b5-0e57-4019-b45e-3e36a74e7f52",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Or directly analyze for further processing\n",
|
|
"Instead of inspecting each of the images, you can also directly carry out the analysis and export the result into a csv. This may take a while depending on how many images you have loaded. Set the keyword `analyse_text` to `True` if you want the text to be analyzed (spell check, subjectivity, text summary, sentiment, NER)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "365c78b1-7ff4-4213-86fa-6a0a2d05198f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for key in mydict:\n",
|
|
" mydict[key] = ammico.text.TextDetector(\n",
|
|
" mydict[key], analyse_text=True\n",
|
|
" ).analyse_image()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3c063eda",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Convert to dataframe and write csv\n",
|
|
"These steps are required to convert the dictionary of dictionarys into a dictionary with lists, that can be converted into a pandas dataframe and exported to a csv file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "5709c2cd",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"outdict = mutils.append_data_to_dict(mydict)\n",
|
|
"df = mutils.dump_df(outdict)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ae182eb7",
|
|
"metadata": {},
|
|
"source": [
|
|
"Check the dataframe:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c4f05637",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "eedf1e47",
|
|
"metadata": {},
|
|
"source": [
|
|
"Write the csv file - here you should provide a file path and file name for the csv file to be written."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "bf6c9ddb",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Write the csv\n",
|
|
"df.to_csv(\"/content/drive/MyDrive/misinformation-data/data_out.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4bc8ac0a",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Topic analysis\n",
|
|
"The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4931941b",
|
|
"metadata": {},
|
|
"source": [
|
|
"BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
|
|
"\n",
|
|
"You can pass which dataframe entry you would like to have analyzed. The default is `text_english`, but you could for example also select `text_summary` or `text_english_correct` setting the keyword `analyze_text` as so:\n",
|
|
"\n",
|
|
"`ammico.text.PostprocessText(mydict=mydict, analyze_text=\"text_summary\").analyse_topic()`\n",
|
|
"\n",
|
|
"### Option 1: Use the dictionary as obtained from the above analysis."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "a3450a61",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
|
|
"topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
|
|
" mydict=mydict\n",
|
|
").analyse_topic()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "95667342",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Option 2: Read in a csv\n",
|
|
"Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "5530e436",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"input_file_path = \"/content/drive/MyDrive/misinformation-data/data_out.csv\"\n",
|
|
"topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
|
|
" use_csv=True, csv_path=input_file_path\n",
|
|
").analyse_topic(return_topics=10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0b6ef6d7",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Access frequent topics\n",
|
|
"A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(topic_df)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b3316770",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Get information for specific topic\n",
|
|
"The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "db14fe03",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for topic in most_frequent_topics:\n",
|
|
" print(\"Topic:\", topic)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d10f701e",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Topic visualization\n",
|
|
"The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2331afe6",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"topic_model.visualize_topics()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f4eaf353",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Save the model\n",
|
|
"The model can be saved for future use."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "e5e8377c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"topic_model.save(\"misinfo_posts\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7c94edb9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.16"
|
|
},
|
|
"vscode": {
|
|
"interpreter": {
|
|
"hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2"
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|