зеркало из
https://github.com/ssciwr/AMMICO.git
synced 2025-10-29 21:16:06 +02:00
341 строка
7.9 KiB
Plaintext
Generated
341 строка
7.9 KiB
Plaintext
Generated
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dcaa3da1",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Notebook for text extraction on image\n",
|
|
"Inga Ulusoy, SSC, July 2022"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "f43f327c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# if running on google colab\n",
|
|
"# flake8-noqa-cell\n",
|
|
"import os\n",
|
|
"\n",
|
|
"if \"google.colab\" in str(get_ipython()):\n",
|
|
" # update python version\n",
|
|
" # install setuptools\n",
|
|
" !pip install setuptools==61 -qqq\n",
|
|
" # install ammico\n",
|
|
" !pip install git+https://github.com/ssciwr/ammico.git -qqq\n",
|
|
" # mount google drive for data and API key\n",
|
|
" from google.colab import drive\n",
|
|
"\n",
|
|
" drive.mount(\"/content/drive\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cf362e60",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import ammico\n",
|
|
"from ammico import utils as mutils\n",
|
|
"from ammico import display as mdisplay"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "27675810",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Here you need to provide the path to your google drive folder\n",
|
|
"# or local folder containing the images\n",
|
|
"images = mutils.find_files(\n",
|
|
" path=\"/content/drive/MyDrive/misinformation-data/\",\n",
|
|
" limit=10,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8b32409f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"mydict = mutils.initialize_dict(images)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7b8b929f",
|
|
"metadata": {},
|
|
"source": [
|
|
"# google cloud vision API\n",
|
|
"First 1000 images per month are free."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"os.environ[\n",
|
|
" \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
|
|
"] = \"/content/drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0891b795-c7fe-454c-a45d-45fadf788142",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Inspect the elements per image"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7c6ecc88",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"mdisplay.explore_analysis(mydict, identify=\"text-on-image\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9c3e72b5-0e57-4019-b45e-3e36a74e7f52",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Or directly analyze for further processing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "365c78b1-7ff4-4213-86fa-6a0a2d05198f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for key in mydict:\n",
|
|
" print(key)\n",
|
|
" mydict[key] = ammico.text.TextDetector(\n",
|
|
" mydict[key], analyse_text=True\n",
|
|
" ).analyse_image()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3c063eda",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Convert to dataframe and write csv"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "5709c2cd",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"outdict = mutils.append_data_to_dict(mydict)\n",
|
|
"df = mutils.dump_df(outdict)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c4f05637",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# check the dataframe\n",
|
|
"df.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "bf6c9ddb",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Write the csv\n",
|
|
"df.to_csv(\"./data_out.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4bc8ac0a",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Topic analysis\n",
|
|
"The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4931941b",
|
|
"metadata": {},
|
|
"source": [
|
|
"BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
|
|
"### Option 1: Use the dictionary as obtained from the above analysis."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "a3450a61",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
|
|
"topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
|
|
" mydict=mydict\n",
|
|
").analyse_topic()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "95667342",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Option 2: Read in a csv\n",
|
|
"Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "5530e436",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"input_file_path = \"data_out.csv\"\n",
|
|
"topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
|
|
" use_csv=True, csv_path=input_file_path\n",
|
|
").analyse_topic(return_topics=10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0b6ef6d7",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Access frequent topics\n",
|
|
"A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(topic_df)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b3316770",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Get information for specific topic\n",
|
|
"The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "db14fe03",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for topic in most_frequent_topics:\n",
|
|
" print(\"Topic:\", topic)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d10f701e",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Topic visualization\n",
|
|
"The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2331afe6",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"topic_model.visualize_topics()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f4eaf353",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Save the model\n",
|
|
"The model can be saved for future use."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "e5e8377c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"topic_model.save(\"misinfo_posts\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7c94edb9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.5"
|
|
},
|
|
"vscode": {
|
|
"interpreter": {
|
|
"hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2"
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|