{ "cells": [ { "cell_type": "markdown", "id": "dcaa3da1", "metadata": {}, "source": [ "# Notebook for text extraction on image\n", "Inga Ulusoy, SSC, July 2022" ] }, { "cell_type": "code", "execution_count": null, "id": "f43f327c", "metadata": {}, "outputs": [], "source": [ "# if running on google colab\n", "# flake8-noqa-cell\n", "import os\n", "\n", "if \"google.colab\" in str(get_ipython()):\n", " # update python version\n", " # install setuptools\n", " !pip install setuptools==61 -qqq\n", " # install misinformation\n", " !pip install git+https://github.com/ssciwr/misinformation.git -qqq\n", " # mount google drive for data and API key\n", " from google.colab import drive\n", "\n", " drive.mount(\"/content/drive\")" ] }, { "cell_type": "code", "execution_count": null, "id": "cf362e60", "metadata": {}, "outputs": [], "source": [ "import os\n", "import misinformation\n", "from misinformation import utils as mutils\n", "from misinformation import display as mdisplay" ] }, { "cell_type": "code", "execution_count": null, "id": "27675810", "metadata": {}, "outputs": [], "source": [ "# Here you need to provide the path to your google drive folder\n", "# or local folder containing the images\n", "images = mutils.find_files(\n", " path=\"/content/drive/MyDrive/misinformation-data/\",\n", " limit=10,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "8b32409f", "metadata": {}, "outputs": [], "source": [ "mydict = mutils.initialize_dict(images)" ] }, { "cell_type": "markdown", "id": "7b8b929f", "metadata": {}, "source": [ "# google cloud vision API\n", "First 1000 images per month are free." ] }, { "cell_type": "code", "execution_count": null, "id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986", "metadata": {}, "outputs": [], "source": [ "os.environ[\n", " \"GOOGLE_APPLICATION_CREDENTIALS\"\n", "] = \"/content/drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\"" ] }, { "cell_type": "markdown", "id": "0891b795-c7fe-454c-a45d-45fadf788142", "metadata": {}, "source": [ "## Inspect the elements per image" ] }, { "cell_type": "code", "execution_count": null, "id": "7c6ecc88", "metadata": {}, "outputs": [], "source": [ "mdisplay.explore_analysis(mydict, identify=\"text-on-image\")" ] }, { "cell_type": "markdown", "id": "9c3e72b5-0e57-4019-b45e-3e36a74e7f52", "metadata": {}, "source": [ "## Or directly analyze for further processing" ] }, { "cell_type": "code", "execution_count": null, "id": "365c78b1-7ff4-4213-86fa-6a0a2d05198f", "metadata": {}, "outputs": [], "source": [ "for key in mydict:\n", " print(key)\n", " mydict[key] = misinformation.text.TextDetector(\n", " mydict[key], analyse_text=True\n", " ).analyse_image()" ] }, { "cell_type": "markdown", "id": "3c063eda", "metadata": {}, "source": [ "## Convert to dataframe and write csv" ] }, { "cell_type": "code", "execution_count": null, "id": "5709c2cd", "metadata": {}, "outputs": [], "source": [ "outdict = mutils.append_data_to_dict(mydict)\n", "df = mutils.dump_df(outdict)" ] }, { "cell_type": "code", "execution_count": null, "id": "c4f05637", "metadata": {}, "outputs": [], "source": [ "# check the dataframe\n", "df.head(10)" ] }, { "cell_type": "code", "execution_count": null, "id": "bf6c9ddb", "metadata": {}, "outputs": [], "source": [ "# Write the csv\n", "df.to_csv(\"./data_out.csv\")" ] }, { "cell_type": "markdown", "id": "4bc8ac0a", "metadata": {}, "source": [ "# Topic analysis\n", "The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline." ] }, { "cell_type": "markdown", "id": "4931941b", "metadata": {}, "source": [ "BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n", "### Option 1: Use the dictionary as obtained from the above analysis." ] }, { "cell_type": "code", "execution_count": null, "id": "a3450a61", "metadata": {}, "outputs": [], "source": [ "# make a list of all the text_english entries per analysed image from the mydict variable as above\n", "topic_model, topic_df, most_frequent_topics = misinformation.text.PostprocessText(\n", " mydict=mydict\n", ").analyse_topic()" ] }, { "cell_type": "markdown", "id": "95667342", "metadata": {}, "source": [ "### Option 2: Read in a csv\n", "Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)." ] }, { "cell_type": "code", "execution_count": null, "id": "5530e436", "metadata": {}, "outputs": [], "source": [ "input_file_path = \"data_out.csv\"\n", "topic_model, topic_df, most_frequent_topics = misinformation.text.PostprocessText(\n", " use_csv=True, csv_path=input_file_path\n", ").analyse_topic(return_topics=10)" ] }, { "cell_type": "markdown", "id": "0b6ef6d7", "metadata": {}, "source": [ "### Access frequent topics\n", "A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic." ] }, { "cell_type": "code", "execution_count": null, "id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f", "metadata": {}, "outputs": [], "source": [ "print(topic_df)" ] }, { "cell_type": "markdown", "id": "b3316770", "metadata": {}, "source": [ "### Get information for specific topic\n", "The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list." ] }, { "cell_type": "code", "execution_count": null, "id": "db14fe03", "metadata": {}, "outputs": [], "source": [ "for topic in most_frequent_topics:\n", " print(\"Topic:\", topic)" ] }, { "cell_type": "markdown", "id": "d10f701e", "metadata": {}, "source": [ "### Topic visualization\n", "The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)." ] }, { "cell_type": "code", "execution_count": null, "id": "2331afe6", "metadata": {}, "outputs": [], "source": [ "topic_model.visualize_topics()" ] }, { "cell_type": "markdown", "id": "f4eaf353", "metadata": {}, "source": [ "### Save the model\n", "The model can be saved for future use." ] }, { "cell_type": "code", "execution_count": null, "id": "e5e8377c", "metadata": {}, "outputs": [], "source": [ "topic_model.save(\"misinfo_posts\")" ] }, { "cell_type": "code", "execution_count": null, "id": "7c94edb9", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" }, "vscode": { "interpreter": { "hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2" } } }, "nbformat": 4, "nbformat_minor": 5 }