зеркало из
https://github.com/ssciwr/AMMICO.git
synced 2025-10-29 05:04:14 +02:00
* [pre-commit.ci] pre-commit autoupdate updates: - [github.com/kynan/nbstripout: 0.6.1 → 0.7.1](https://github.com/kynan/nbstripout/compare/0.6.1...0.7.1) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
372 строки
11 KiB
Plaintext
372 строки
11 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Notebook for text extraction on image\n",
|
|
"\n",
|
|
"The text extraction and analysis is carried out using a variety of tools: \n",
|
|
"\n",
|
|
"1. Text extraction from the image using [google-cloud-vision](https://cloud.google.com/vision) \n",
|
|
"1. Language detection of the extracted text using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
|
|
"1. Translation into English or other languages using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
|
|
"1. Cleaning of the text using [spacy](https://spacy.io/) \n",
|
|
"1. Spell-check using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
|
|
"1. Subjectivity analysis using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
|
|
"1. Text summarization using [transformers](https://huggingface.co/docs/transformers/index) pipelines\n",
|
|
"1. Sentiment analysis using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
|
|
"1. Named entity recognition using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
|
|
"1. Topic analysis using [BERTopic](https://github.com/MaartenGr/BERTopic) \n",
|
|
"\n",
|
|
"The first cell is only run on google colab and installs the [ammico](https://github.com/ssciwr/AMMICO) package.\n",
|
|
"\n",
|
|
"After that, we can import `ammico` and read in the files given a folder path."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "1",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# if running on google colab\n",
|
|
"# flake8-noqa-cell\n",
|
|
"import os\n",
|
|
"\n",
|
|
"if \"google.colab\" in str(get_ipython()):\n",
|
|
" # update python version\n",
|
|
" # install setuptools\n",
|
|
" # %pip install setuptools==61 -qqq\n",
|
|
" # uninstall some pre-installed packages due to incompatibility\n",
|
|
" %pip uninstall tensorflow-probability dopamine-rl lida pandas-gbq torchaudio torchdata torchtext orbax-checkpoint -y -qqq\n",
|
|
" # install ammico\n",
|
|
" %pip install git+https://github.com/ssciwr/ammico.git -qqq\n",
|
|
" # mount google drive for data and API key\n",
|
|
" from google.colab import drive\n",
|
|
"\n",
|
|
" drive.mount(\"/content/drive\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import ammico"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3",
|
|
"metadata": {},
|
|
"source": [
|
|
"We select a subset of image files to try the text extraction on, see the `limit` keyword. The `find_files` function finds image files within a given directory and initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis: "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "4",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Here you need to provide the path to your google drive folder\n",
|
|
"# or local folder containing the images\n",
|
|
"mydict = ammico.find_files(\n",
|
|
" path=\"/content/drive/MyDrive/misinformation-data/\",\n",
|
|
" limit=10,\n",
|
|
")\n",
|
|
"mydict"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Google cloud vision API\n",
|
|
"\n",
|
|
"For this you need an API key and have the app activated in your google console. The first 1000 images per month are free (July 2022)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"os.environ[\n",
|
|
" \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
|
|
"] = \"/content/drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Inspect the elements per image\n",
|
|
"To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below.\n",
|
|
"Here, we display the text extraction and translation results provided by the above libraries. Click on the tabs to see the results in the right sidebar. You may need to increment the `port` number if you are already running several notebook instances on the same server."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"analysis_explorer = ammico.AnalysisExplorer(mydict)\n",
|
|
"analysis_explorer.run_server(port=8054)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Or directly analyze for further processing\n",
|
|
"Instead of inspecting each of the images, you can also directly carry out the analysis and export the result into a csv. This may take a while depending on how many images you have loaded. Set the keyword `analyse_text` to `True` if you want the text to be analyzed (spell check, subjectivity, text summary, sentiment, NER)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "10",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for key in mydict:\n",
|
|
" mydict[key] = ammico.TextDetector(\n",
|
|
" mydict[key], analyse_text=True\n",
|
|
" ).analyse_image()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "11",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Convert to dataframe and write csv\n",
|
|
"These steps are required to convert the dictionary of dictionarys into a dictionary with lists, that can be converted into a pandas dataframe and exported to a csv file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "12",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = ammico.get_dataframe(mydict)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "13",
|
|
"metadata": {},
|
|
"source": [
|
|
"Check the dataframe:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "14",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "15",
|
|
"metadata": {},
|
|
"source": [
|
|
"Write the csv file - here you should provide a file path and file name for the csv file to be written."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "16",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Write the csv\n",
|
|
"df.to_csv(\"/content/drive/MyDrive/misinformation-data/data_out.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "17",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Topic analysis\n",
|
|
"The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "18",
|
|
"metadata": {},
|
|
"source": [
|
|
"BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
|
|
"\n",
|
|
"You can pass which dataframe entry you would like to have analyzed. The default is `text_english`, but you could for example also select `text_summary` or `text_english_correct` setting the keyword `analyze_text` as so:\n",
|
|
"\n",
|
|
"`ammico.text.PostprocessText(mydict=mydict, analyze_text=\"text_summary\").analyse_topic()`\n",
|
|
"\n",
|
|
"### Option 1: Use the dictionary as obtained from the above analysis."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "19",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
|
|
"topic_model, topic_df, most_frequent_topics = ammico.PostprocessText(\n",
|
|
" mydict=mydict\n",
|
|
").analyse_topic()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "20",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Option 2: Read in a csv\n",
|
|
"Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "21",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"input_file_path = \"/content/drive/MyDrive/misinformation-data/data_out.csv\"\n",
|
|
"topic_model, topic_df, most_frequent_topics = ammico.PostprocessText(\n",
|
|
" use_csv=True, csv_path=input_file_path\n",
|
|
").analyse_topic(return_topics=10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "22",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Access frequent topics\n",
|
|
"A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "23",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(topic_df)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "24",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Get information for specific topic\n",
|
|
"The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "25",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for topic in most_frequent_topics:\n",
|
|
" print(\"Topic:\", topic)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "26",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Topic visualization\n",
|
|
"The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "27",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"topic_model.visualize_topics()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "28",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Save the model\n",
|
|
"The model can be saved for future use."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "29",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"topic_model.save(\"misinfo_posts\")"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.16"
|
|
},
|
|
"vscode": {
|
|
"interpreter": {
|
|
"hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2"
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|