diff --git a/README.md b/README.md index 6013352..3d86c8f 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Misinformation campaign analysis +# AMMICO - AI Media and Misinformation Content Analysis Tool ![License: MIT](https://img.shields.io/github/license/ssciwr/misinformation) ![GitHub Workflow Status](https://img.shields.io/github/actions/workflow/status/ssciwr/misinformation/ci.yml?branch=main) @@ -6,44 +6,42 @@ ![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=ssciwr_misinformation&metric=alert_status) ![Language](https://img.shields.io/github/languages/top/ssciwr/misinformation) -Extract data from from social media images and texts in disinformation campaigns. +This package extracts data from images such as social media images, and the accompanying text/text that is included in the image. The analysis can extract a very large number of features, depending on the user input. **_This project is currently under development!_** -Use the pre-processed social media posts (image files) and process to collect information: -1. Cropping images to remove comments from posts +Use pre-processed image files such as social media posts with comments and process to collect information: 1. Text extraction from the images -1. Language recognition, translation into English, cleaning of the text/spell-check -1. Sentiment and subjectivity analysis -1. Performing person and face recognition in images, emotion recognition -1. Extraction of other non-human objects in the image + 1. Language detection + 1. Translation into English or other languages + 1. Cleaning of the text, spell-check + 1. Sentiment analysis + 1. Subjectivity analysis + 1. Named entity recognition + 1. Topic analysis +1. Content extraction from the images + 1. Textual summary of the image content ("image caption") that can be analyzed further using the above tools + 1. Feature extraction from the images: User inputs query and images are matched to that query (both text and image query) + 1. Question answering +1. Performing person and face recognition in images + 1. Face mask detection + 1. Age, gender and race detection + 1. Emotion recognition +1. Object detection in images + 1. Detection of position and number of objects in the image; currently person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, cell phone +1. Cropping images to remove comments from posts -This development will serve the fight to combat misinformation, by providing more comprehensive data about its content and techniques. -The ultimate goal of this project is to develop a computer-assisted toolset to investigate the content of disinformation campaigns worldwide. -# Installation +## Installation -The `misinformation` package can be installed using pip: Navigate into your package folder `misinformation/` and execute +The `AMMICO` package can be installed using pip: Navigate into your package folder `misinformation/` and execute ``` pip install . ``` This will install the package and its dependencies locally. -## Installation on Windows -Some modules use [lavis]() to anaylse image content. To enable this functionality on Windows OS, you need to install some dependencies that are not available by default or can be obtained from the command line: -1. Download [Visual C++](https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170) and install (see also [here](https://github.com/philferriere/cocoapi)). -1. Then install the coco API from Github -``` -pip install "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI" -``` -1. Now you can install the package by navigating to the misinformation directory and typing -``` -pip install . -``` -in the command prompt. - -# Usage +## Usage There are sample notebooks in the `misinformation/notebooks` folder for you to explore the package: 1. Text analysis: Use the notebook `get-text-from-image.ipynb` to extract any text from the images. The text is directly translated into English. If the text should be further analysed, set the keyword `analyse_text` to `True` as demonstrated in the notebook.\ @@ -56,8 +54,8 @@ Place the data files in your google drive to access the data.** There are further notebooks that are currently of exploratory nature (`colors_expression.ipynb` to identify certain colors on the image). -# Features -## Text extraction +## Features +### Text extraction The text is extracted from the images using [`google-cloud-vision`](https://cloud.google.com/vision). For this, you need an API key. Set up your google account following the instructions on the google Vision AI website. You then need to export the location of the API key as an environment variable: `export GOOGLE_APPLICATION_CREDENTIALS="location of your .json"` @@ -67,8 +65,8 @@ The extracted text is then stored under the `text` key (column when exporting a If you further want to analyse the text, you have to set the `analyse_text` keyword to `True`. In doing so, the text is then processed using [spacy](https://spacy.io/) (tokenized, part-of-speech, lemma, ...). The English text is cleaned from numbers and unrecognized words (`text_clean`), spelling of the English text is corrected (`text_english_correct`), and further sentiment and subjectivity analysis are carried out (`polarity`, `subjectivity`). The latter two steps are carried out using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html). For more information on the sentiment analysis using TextBlob see [here](https://towardsdatascience.com/my-absolute-go-to-for-sentiment-analysis-textblob-3ac3a11d524). -## Emotion recognition +### Emotion recognition -## Object detection +### Object detection -## Cropping of posts \ No newline at end of file +### Cropping of posts diff --git a/docs/source/conf.py b/docs/source/conf.py index d4dd3e2..af54fdc 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -20,7 +20,8 @@ release = "0.0.1" # -- General configuration --------------------------------------------------- # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration -extensions = ["sphinx.ext.autodoc", "sphinx.ext.napoleon", "myst_parser"] +extensions = ["sphinx.ext.autodoc", "sphinx.ext.napoleon", "myst_parser", "nbsphinx"] +nbsphinx_allow_errors = True templates_path = ["_templates"] exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] diff --git a/docs/source/index.rst b/docs/source/index.rst index fd09508..074022f 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -11,6 +11,7 @@ Welcome to misinformation's documentation! :caption: Contents: readme_link + notebooks/Example faces modules license_link diff --git a/docs/source/notebooks/3_blip_saved_features_image.pt b/docs/source/notebooks/3_blip_saved_features_image.pt new file mode 100644 index 0000000..5b259e5 Binary files /dev/null and b/docs/source/notebooks/3_blip_saved_features_image.pt differ diff --git a/docs/source/notebooks/Example faces.ipynb b/docs/source/notebooks/Example faces.ipynb new file mode 100644 index 0000000..7da9ac1 --- /dev/null +++ b/docs/source/notebooks/Example faces.ipynb @@ -0,0 +1,220 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d2c4d40d-8aca-4024-8d19-a65c4efe825d", + "metadata": {}, + "source": [ + "# Facial Expression recognition with DeepFace" + ] + }, + { + "cell_type": "markdown", + "id": "51f8888b-d1a3-4b85-a596-95c0993fa192", + "metadata": {}, + "source": [ + "This notebooks shows some preliminary work on detecting facial expressions with DeepFace. It is mainly meant to explore its capabilities and to decide on future research directions. We package our code into a `misinformation` package that is imported here:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b21e52a5-d379-42db-aae6-f2ab9ed9a369", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import misinformation\n", + "from misinformation import utils as mutils\n", + "from misinformation import display as mdisplay" + ] + }, + { + "cell_type": "markdown", + "id": "a2bd2153", + "metadata": {}, + "source": [ + "We select a subset of image files to try facial expression detection on. The `find_files` function finds image files within a given directory:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "afe7e638-f09d-47e7-9295-1c374bd64c53", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "images = mutils.find_files(\n", + " path=\"data/\",\n", + " limit=10,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e149bfe5-90b0-49b2-af3d-688e41aab019", + "metadata": {}, + "source": [ + "If you want to fine tune the discovery of image files, you can provide more parameters:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f38bb8ed-1004-4e33-8ed6-793cb5869400", + "metadata": {}, + "outputs": [], + "source": [ + "?mutils.find_files" + ] + }, + { + "cell_type": "markdown", + "id": "705e7328", + "metadata": {}, + "source": [ + "We need to initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b37c0c91", + "metadata": {}, + "outputs": [], + "source": [ + "mydict = mutils.initialize_dict(images)" + ] + }, + { + "cell_type": "markdown", + "id": "a9372561", + "metadata": {}, + "source": [ + "To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below.\n", + "Here, we display the face recognition results provided by the DeepFace library. Click on the tabs to see the results in the right sidebar:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "992499ed-33f1-4425-ad5d-738cf565d175", + "metadata": {}, + "outputs": [], + "source": [ + "mdisplay.explore_analysis(mydict, identify=\"faces\")" + ] + }, + { + "cell_type": "markdown", + "id": "6f974341", + "metadata": {}, + "source": [ + "Directly carry out the analysis and export the result into a csv: Analysis - " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6f97c7d0", + "metadata": {}, + "outputs": [], + "source": [ + "for key in mydict.keys():\n", + " mydict[key] = misinformation.faces.EmotionDetector(mydict[key]).analyse_image()" + ] + }, + { + "cell_type": "markdown", + "id": "174357b1", + "metadata": {}, + "source": [ + "Convert the dictionary of dictionarys into a dictionary with lists:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "604bd257", + "metadata": {}, + "outputs": [], + "source": [ + "outdict = mutils.append_data_to_dict(mydict)\n", + "df = mutils.dump_df(outdict)" + ] + }, + { + "cell_type": "markdown", + "id": "8373d9f8", + "metadata": {}, + "source": [ + "Check the dataframe:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aa4b518a", + "metadata": {}, + "outputs": [], + "source": [ + "df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "579cd59f", + "metadata": {}, + "source": [ + "Write the csv file:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4618decb", + "metadata": {}, + "outputs": [], + "source": [ + "df.to_csv(\"data/data_out.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b1a80023", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + }, + "vscode": { + "interpreter": { + "hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/source/notebooks/Example multimodal.ipynb b/docs/source/notebooks/Example multimodal.ipynb new file mode 100644 index 0000000..c091b84 --- /dev/null +++ b/docs/source/notebooks/Example multimodal.ipynb @@ -0,0 +1,341 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "22df2297-0629-45aa-b88c-6c61f1544db6", + "metadata": {}, + "source": [ + "# Image Multimodal Search" + ] + }, + { + "cell_type": "markdown", + "id": "9eeeb302-296e-48dc-86c7-254aa02f2b3a", + "metadata": {}, + "source": [ + "This notebooks shows some preliminary work on Image Multimodal Search with lavis library. It is mainly meant to explore its capabilities and to decide on future research directions. We package our code into a `misinformation` package that is imported here:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f10ad6c9-b1a0-4043-8c5d-ed660d77be37", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import misinformation\n", + "import misinformation.multimodal_search as ms" + ] + }, + { + "cell_type": "markdown", + "id": "acf08b44-3ea6-44cd-926d-15c0fd9f39e0", + "metadata": {}, + "source": [ + "Set an image path as input file path." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d3fe589-ff3c-4575-b8f5-650db85596bc", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "images = misinformation.utils.find_files(\n", + " path=\"data/\",\n", + " limit=10,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "adf3db21-1f8b-4d44-bbef-ef0acf4623a0", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "mydict = misinformation.utils.initialize_dict(images)" + ] + }, + { + "cell_type": "markdown", + "id": "987540a8-d800-4c70-a76b-7bfabaf123fa", + "metadata": {}, + "source": [ + "## Indexing and extracting features from images in selected folder" + ] + }, + { + "cell_type": "markdown", + "id": "66d6ede4-00bc-4aeb-9a36-e52d7de33fe5", + "metadata": {}, + "source": [ + "You can choose one of the following models: blip, blip2, albef, clip_base, clip_vitl14, clip_vitl14_336" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7bbca1f0-d4b0-43cd-8e05-ee39d37c328e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "model_type = \"blip\"\n", + "# model_type = \"blip2\"\n", + "# model_type = \"albef\"\n", + "# model_type = \"clip_base\"\n", + "# model_type = \"clip_vitl14\"\n", + "# model_type = \"clip_vitl14_336\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ca095404-57d0-4f5d-aeb0-38c232252b17", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "(\n", + " model,\n", + " vis_processors,\n", + " txt_processors,\n", + " image_keys,\n", + " image_names,\n", + " features_image_stacked,\n", + ") = ms.MultimodalSearch.parsing_images(mydict, model_type)" + ] + }, + { + "cell_type": "markdown", + "id": "9ff8a894-566b-4c4f-acca-21c50b5b1f52", + "metadata": {}, + "source": [ + "The tensors of all images `features_image_stacked` was saved in `__saved_features_image.pt`. If you run it once for current model and current set of images you do not need to repeat it again. Instead you can load this features with the command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "56c6d488-f093-4661-835a-5c73a329c874", + "metadata": {}, + "outputs": [], + "source": [ + "# (\n", + "# model,\n", + "# vis_processors,\n", + "# txt_processors,\n", + "# image_keys,\n", + "# image_names,\n", + "# features_image_stacked,\n", + "# ) = ms.MultimodalSearch.parsing_images(mydict, model_type,\"18_clip_base_saved_features_image.pt\")" + ] + }, + { + "cell_type": "markdown", + "id": "309923c1-d6f8-4424-8fca-bde5f3a98b38", + "metadata": {}, + "source": [ + "Here we already processed our image folder with 18 images with `clip_base` model. So you need just write the name `18_clip_base_saved_features_image.pt` of the saved file that consists of tensors of all images as a 3rd argument to the previous function. " + ] + }, + { + "cell_type": "markdown", + "id": "162a52e8-6652-4897-b92e-645cab07aaef", + "metadata": {}, + "source": [ + "Next, you need to form search queries. You can search either by image or by text. You can search for a single query, or you can search for several queries at once, the computational time should not be much different. The format of the queries is as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c4196a52-d01e-42e4-8674-5712f7d6f792", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "search_query3 = [\n", + " {\"text_input\": \"politician press conference\"},\n", + " {\"text_input\": \"a person wearing a mask\"},\n", + " {\"image\": \"data/106349S_por.png\"},\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "8bcf3127-3dfd-4ff4-b9e7-a043099b1418", + "metadata": {}, + "source": [ + "You can filter your results in 3 different ways:\n", + "- `filter_number_of_images` limits the number of images found. That is, if the parameter `filter_number_of_images = 10`, then the first 10 images that best match the query will be shown. The other images ranks will be set to `None` and the similarity value to `0`.\n", + "- `filter_val_limit` limits the output of images with a similarity value not bigger than `filter_val_limit`. That is, if the parameter `filter_val_limit = 0.2`, all images with similarity less than 0.2 will be discarded.\n", + "- `filter_rel_error` (percentage) limits the output of images with a similarity value not bigger than `100 * abs(current_simularity_value - best_simularity_value_in_current_search)/best_simularity_value_in_current_search < filter_rel_error`. That is, if we set filter_rel_error = 30, it means that if the top1 image have 0.5 similarity value, we discard all image with similarity less than 0.35." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f7dc52f-7ee9-4590-96b7-e0d9d3b82378", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "similarity = ms.MultimodalSearch.multimodal_search(\n", + " mydict,\n", + " model,\n", + " vis_processors,\n", + " txt_processors,\n", + " model_type,\n", + " image_keys,\n", + " features_image_stacked,\n", + " search_query3,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e1cf7e46-0c2c-4fb2-b89a-ef585ccb9339", + "metadata": {}, + "source": [ + "After launching `multimodal_search` function, the results of each query will be added to the source dictionary. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ad74b21-6187-4a58-9ed8-fd3e80f5a4ed", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "mydict[\"106349S_por\"]" + ] + }, + { + "cell_type": "markdown", + "id": "cd3ee120-8561-482b-a76a-e8f996783325", + "metadata": {}, + "source": [ + "A special function was written to present the search results conveniently. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4324e4fd-e9aa-4933-bb12-074d54e0c510", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ms.MultimodalSearch.show_results(mydict, search_query3[0])" + ] + }, + { + "cell_type": "markdown", + "id": "d86ab96b-1907-4b7f-a78e-3983b516d781", + "metadata": { + "tags": [] + }, + "source": [ + "## Save search results to csv" + ] + }, + { + "cell_type": "markdown", + "id": "4bdbc4d4-695d-4751-ab7c-d2d98e2917d7", + "metadata": { + "tags": [] + }, + "source": [ + "Convert the dictionary of dictionarys into a dictionary with lists:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c6ddd83-bc87-48f2-a8d6-1bd3f4201ff7", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "outdict = misinformation.utils.append_data_to_dict(mydict)\n", + "df = misinformation.utils.dump_df(outdict)" + ] + }, + { + "cell_type": "markdown", + "id": "ea2675d5-604c-45e7-86d2-080b1f4559a0", + "metadata": { + "tags": [] + }, + "source": [ + "Check the dataframe:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e78646d6-80be-4d3e-8123-3360957bcaa8", + "metadata": {}, + "outputs": [], + "source": [ + "df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "05546d99-afab-4565-8f30-f14e1426abcf", + "metadata": {}, + "source": [ + "Write the csv file:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "185f7dde-20dc-44d8-9ab0-de41f9b5734d", + "metadata": {}, + "outputs": [], + "source": [ + "df.to_csv(\"./data_out.csv\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/source/notebooks/Example objects.ipynb b/docs/source/notebooks/Example objects.ipynb new file mode 100644 index 0000000..567ba77 --- /dev/null +++ b/docs/source/notebooks/Example objects.ipynb @@ -0,0 +1,174 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Objects Expression recognition" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebooks shows some preliminary work on detecting objects expressions with cvlib. It is mainly meant to explore its capabilities and to decide on future research directions. We package our code into a `misinformation` package that is imported here:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import misinformation\n", + "from misinformation import utils as mutils\n", + "from misinformation import display as mdisplay\n", + "import misinformation.objects as ob" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Set an image path as input file path." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "images = mutils.find_files(\n", + " path=\"data/\",\n", + " limit=10,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "mydict = mutils.initialize_dict(images)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Manually inspect what was detected\n", + "\n", + "To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mdisplay.explore_analysis(mydict, identify=\"objects\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Detect objects and directly write to csv" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for key in mydict:\n", + " mydict[key] = ob.ObjectDetector(mydict[key]).analyse_image()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Convert the dictionary of dictionarys into a dictionary with lists:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "outdict = mutils.append_data_to_dict(mydict)\n", + "df = mutils.dump_df(outdict)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check the dataframe:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Write the csv file:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.to_csv(\"./data_out.csv\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + }, + "vscode": { + "interpreter": { + "hash": "f1142466f556ab37fe2d38e2897a16796906208adb09fea90ba58bdf8a56f0ba" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/source/notebooks/Example summary.ipynb b/docs/source/notebooks/Example summary.ipynb new file mode 100644 index 0000000..1177a72 --- /dev/null +++ b/docs/source/notebooks/Example summary.ipynb @@ -0,0 +1,292 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Image summary and visual question answering" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebooks shows some preliminary work on Image Captioning and Visual question answering with lavis. It is mainly meant to explore its capabilities and to decide on future research directions. We package our code into a `misinformation` package that is imported here:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import misinformation\n", + "from misinformation import utils as mutils\n", + "from misinformation import display as mdisplay\n", + "import misinformation.summary as sm" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Set an image path as input file path." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "images = mutils.find_files(\n", + " path=\"../misinformation/test/data/\",\n", + " limit=1000,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mydict = mutils.initialize_dict(images[0:10])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mydict" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create captions for images and directly write to csv" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here you can choose between two models: \"base\" or \"large\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "summary_model, summary_vis_processors = mutils.load_model(\"base\")\n", + "# summary_model, summary_vis_processors = mutils.load_model(\"large\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for key in mydict:\n", + " mydict[key] = sm.SummaryDetector(mydict[key]).analyse_image(\n", + " summary_model, summary_vis_processors\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "Convert the dictionary of dictionarys into a dictionary with lists:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "outdict = mutils.append_data_to_dict(mydict)\n", + "df = mutils.dump_df(outdict)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check the dataframe:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Write the csv file:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.to_csv(\"./data_out.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Manually inspect the summaries\n", + "\n", + "To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing.\n", + "\n", + "`const_image_summary` - the permanent summarys, which does not change from run to run (analyse_image).\n", + "\n", + "`3_non-deterministic summary` - 3 different summarys examples that change from run to run (analyse_image). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mdisplay.explore_analysis(mydict, identify=\"summary\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Generate answers to free-form questions about images written in natural language. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Set the list of questions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "list_of_questions = [\n", + " \"How many persons on the picture?\",\n", + " \"Are there any politicians in the picture?\",\n", + " \"Does the picture show something from medicine?\",\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for key in mydict:\n", + " mydict[key] = sm.SummaryDetector(mydict[key]).analyse_questions(list_of_questions)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mdisplay.explore_analysis(mydict, identify=\"summary\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Convert the dictionary of dictionarys into a dictionary with lists:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "outdict2 = mutils.append_data_to_dict(mydict)\n", + "df2 = mutils.dump_df(outdict2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df2.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df2.to_csv(\"./data_out2.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + }, + "vscode": { + "interpreter": { + "hash": "f1142466f556ab37fe2d38e2897a16796906208adb09fea90ba58bdf8a56f0ba" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/source/notebooks/Example text.ipynb b/docs/source/notebooks/Example text.ipynb new file mode 100644 index 0000000..0542220 --- /dev/null +++ b/docs/source/notebooks/Example text.ipynb @@ -0,0 +1,362 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "dcaa3da1", + "metadata": {}, + "source": [ + "# Notebook for text extraction on image\n", + "Inga Ulusoy, SSC, July 2022" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f43f327c", + "metadata": {}, + "outputs": [], + "source": [ + "# if running on google colab\n", + "# flake8-noqa-cell\n", + "import os\n", + "\n", + "if \"google.colab\" in str(get_ipython()):\n", + " # update python version\n", + " # install setuptools\n", + " !pip install setuptools==61 -qqq\n", + " # install misinformation\n", + " !pip install git+https://github.com/ssciwr/misinformation.git -qqq\n", + " # mount google drive for data and API key\n", + " from google.colab import drive\n", + "\n", + " drive.mount(\"/content/drive\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cf362e60", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from IPython.display import Image, display\n", + "import misinformation\n", + "from misinformation import utils as mutils\n", + "from misinformation import display as mdisplay\n", + "import tensorflow as tf\n", + "\n", + "print(tf.config.list_physical_devices(\"GPU\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27675810", + "metadata": {}, + "outputs": [], + "source": [ + "# download the models if they are not there yet\n", + "!python -m spacy download en_core_web_md\n", + "!python -m textblob.download_corpora" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6da3a7aa", + "metadata": {}, + "outputs": [], + "source": [ + "images = mutils.find_files(path=\"../data/all/\", limit=1000)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf811ce0", + "metadata": {}, + "outputs": [], + "source": [ + "for i in images[0:3]:\n", + " display(Image(filename=i))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b32409f", + "metadata": {}, + "outputs": [], + "source": [ + "mydict = mutils.initialize_dict(images[0:3])" + ] + }, + { + "cell_type": "markdown", + "id": "7b8b929f", + "metadata": {}, + "source": [ + "# google cloud vision API\n", + "First 1000 images per month are free." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986", + "metadata": {}, + "outputs": [], + "source": [ + "os.environ[\n", + " \"GOOGLE_APPLICATION_CREDENTIALS\"\n", + "] = \"../data/misinformation-campaign-981aa55a3b13.json\"" + ] + }, + { + "cell_type": "markdown", + "id": "0891b795-c7fe-454c-a45d-45fadf788142", + "metadata": {}, + "source": [ + "## Inspect the elements per image" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c6ecc88", + "metadata": {}, + "outputs": [], + "source": [ + "mdisplay.explore_analysis(mydict, identify=\"text-on-image\")" + ] + }, + { + "cell_type": "markdown", + "id": "9c3e72b5-0e57-4019-b45e-3e36a74e7f52", + "metadata": {}, + "source": [ + "## Or directly analyze for further processing" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "365c78b1-7ff4-4213-86fa-6a0a2d05198f", + "metadata": {}, + "outputs": [], + "source": [ + "for key in mydict:\n", + " print(key)\n", + " mydict[key] = misinformation.text.TextDetector(\n", + " mydict[key], analyse_text=True\n", + " ).analyse_image()" + ] + }, + { + "cell_type": "markdown", + "id": "3c063eda", + "metadata": {}, + "source": [ + "## Convert to dataframe and write csv" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5709c2cd", + "metadata": {}, + "outputs": [], + "source": [ + "outdict = mutils.append_data_to_dict(mydict)\n", + "df = mutils.dump_df(outdict)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c4f05637", + "metadata": {}, + "outputs": [], + "source": [ + "# check the dataframe\n", + "df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf6c9ddb", + "metadata": {}, + "outputs": [], + "source": [ + "# Write the csv\n", + "df.to_csv(\"./data_out.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "4bc8ac0a", + "metadata": {}, + "source": [ + "# Topic analysis\n", + "The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline." + ] + }, + { + "cell_type": "markdown", + "id": "4931941b", + "metadata": {}, + "source": [ + "BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n", + "### Option 1: Use the dictionary as obtained from the above analysis." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a3450a61", + "metadata": {}, + "outputs": [], + "source": [ + "# make a list of all the text_english entries per analysed image from the mydict variable as above\n", + "topic_model, topic_df, most_frequent_topics = misinformation.text.PostprocessText(\n", + " mydict=mydict\n", + ").analyse_topic()" + ] + }, + { + "cell_type": "markdown", + "id": "95667342", + "metadata": {}, + "source": [ + "### Option 2: Read in a csv\n", + "Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5530e436", + "metadata": {}, + "outputs": [], + "source": [ + "input_file_path = \"data_out.csv\"\n", + "topic_model, topic_df, most_frequent_topics = misinformation.text.PostprocessText(\n", + " use_csv=True, csv_path=input_file_path\n", + ").analyse_topic(return_topics=10)" + ] + }, + { + "cell_type": "markdown", + "id": "0b6ef6d7", + "metadata": {}, + "source": [ + "### Access frequent topics\n", + "A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f", + "metadata": {}, + "outputs": [], + "source": [ + "print(topic_df)" + ] + }, + { + "cell_type": "markdown", + "id": "b3316770", + "metadata": {}, + "source": [ + "### Get information for specific topic\n", + "The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db14fe03", + "metadata": {}, + "outputs": [], + "source": [ + "for topic in most_frequent_topics:\n", + " print(\"Topic:\", topic)" + ] + }, + { + "cell_type": "markdown", + "id": "d10f701e", + "metadata": {}, + "source": [ + "### Topic visualization\n", + "The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2331afe6", + "metadata": {}, + "outputs": [], + "source": [ + "topic_model.visualize_topics()" + ] + }, + { + "cell_type": "markdown", + "id": "f4eaf353", + "metadata": {}, + "source": [ + "### Save the model\n", + "The model can be saved for future use." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5e8377c", + "metadata": {}, + "outputs": [], + "source": [ + "topic_model.save(\"misinfo_posts\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c94edb9", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.5" + }, + "vscode": { + "interpreter": { + "hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/source/notebooks/data/102141_2_eng.png b/docs/source/notebooks/data/102141_2_eng.png new file mode 100644 index 0000000..4693c0e Binary files /dev/null and b/docs/source/notebooks/data/102141_2_eng.png differ diff --git a/docs/source/notebooks/data/102730_eng.png b/docs/source/notebooks/data/102730_eng.png new file mode 100644 index 0000000..7ee3130 Binary files /dev/null and b/docs/source/notebooks/data/102730_eng.png differ diff --git a/docs/source/notebooks/data/106349S_por.png b/docs/source/notebooks/data/106349S_por.png new file mode 100644 index 0000000..a6cdcb2 Binary files /dev/null and b/docs/source/notebooks/data/106349S_por.png differ diff --git a/misinformation/test/data/IMG_3758.png b/misinformation/test/data/IMG_3758.png index bf385ee..62d349c 100644 Binary files a/misinformation/test/data/IMG_3758.png and b/misinformation/test/data/IMG_3758.png differ diff --git a/misinformation/utils.py b/misinformation/utils.py index 36c7690..1862b1d 100644 --- a/misinformation/utils.py +++ b/misinformation/utils.py @@ -2,6 +2,8 @@ import glob import os from pandas import DataFrame import pooch +from torch import device, cuda +from lavis.models import load_model_and_preprocess class DownloadResource: @@ -106,3 +108,34 @@ if __name__ == "__main__": outdict = append_data_to_dict(mydict) df = dump_df(outdict) print(df.head(10)) + + +def load_model_base(): + summary_device = device("cuda" if cuda.is_available() else "cpu") + summary_model, summary_vis_processors, _ = load_model_and_preprocess( + name="blip_caption", + model_type="base_coco", + is_eval=True, + device=summary_device, + ) + return summary_model, summary_vis_processors + + +def load_model_large(): + summary_device = device("cuda" if cuda.is_available() else "cpu") + summary_model, summary_vis_processors, _ = load_model_and_preprocess( + name="blip_caption", + model_type="large_coco", + is_eval=True, + device=summary_device, + ) + return summary_model, summary_vis_processors + + +def load_model(model_type): + select_model = { + "base": load_model_base, + "large": load_model_large, + } + summary_model, summary_vis_processors = select_model[model_type]() + return summary_model, summary_vis_processors diff --git a/pyproject.toml b/pyproject.toml index a276151..7c68bea 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -30,7 +30,6 @@ dependencies = [ "importlib_metadata", "ipython", "ipywidgets", - "jupyterlab", "matplotlib", "numpy<=1.23.4", "pandas", diff --git a/requirements-dev.txt b/requirements-dev.txt index 361d4f4..27bf7d7 100644 --- a/requirements-dev.txt +++ b/requirements-dev.txt @@ -1,4 +1,5 @@ sphinx myst-parser sphinx_rtd_theme -sphinxcontrib-napoleon \ No newline at end of file +sphinxcontrib-napoleon +nbsphinx \ No newline at end of file