add text classification transformers (#68)

* add text classification transformers * add ner * use specified model for tasks; allow summary in BERT * update notebooks and dockerfile * links for notebooks on colab * links for notebooks on colab * update notebooks image path for colab
2025-10-29 13:06:04 +02:00 · 2023-04-02 14:36:08 +03:00 · 2023-04-02 14:36:08 +03:00 · 3b1c3ef1ed
--- a/3
+++ b/3
@ -11,9 +11,6 @@ COPY --chown=${NB_UID} . /opt/misinformation
 # Install the Python package
 RUN python -m pip install /opt/misinformation

-# Install additional dependencies for running the notebooks
-RUN python -m pip install -r /opt/misinformation/requirements.txt
-
 # Make JupyterLab the default for this application
 ENV JUPYTER_ENABLE_LAB=yes

--- a/README.md
+++ b/README.md
@ -44,15 +44,20 @@ This will install the package and its dependencies locally.
 ## Usage

 There are sample notebooks in the `misinformation/notebooks` folder for you to explore the package:
-1. Text analysis: Use the notebook `get-text-from-image.ipynb` to extract any text from the images. The text is directly translated into English. If the text should be further analysed, set the keyword `analyse_text` to `True` as demonstrated in the notebook.\
+1. Text extraction: Use the notebook `get-text-from-image.ipynb` to extract any text from the images. The text is directly translated into English. If the text should be further analysed, set the keyword `analyse_text` to `True` as demonstrated in the notebook.\
 **You can run this notebook on google colab: [Here](https://colab.research.google.com/github/ssciwr/misinformation/blob/main/notebooks/get-text-from-image.ipynb)**  
 Place the data files and google cloud vision API key in your google drive to access the data.
-1. Facial analysis: Use the notebook `facial_expressions.ipynb` to identify if there are faces on the image, if they are wearing masks, and if they are not wearing masks also the race, gender and dominant emotion.
+1. Emotion recognition: Use the notebook `facial_expressions.ipynb` to identify if there are faces on the image, if they are wearing masks, and if they are not wearing masks also the race, gender and dominant emotion.
 **You can run this notebook on google colab: [Here](https://colab.research.google.com/github/ssciwr/misinformation/blob/main/notebooks/facial_expressions.ipynb)**   
-Place the data files in your google drive to access the data.**
+Place the data files in your google drive to access the data.
+1. Content extraction: Use the notebook `image_summary.ipynb` to create captions for the images and ask questions about the image content.
+**You can run this notebook on google colab: [Here](https://colab.research.google.com/github/ssciwr/misinformation/blob/main/notebooks/image_summary.ipynb)**
+1. Multimodal content: Use the notebook `multimodal_search.ipynb` to find the best fitting images to an image or text query.
+**You can run this notebook on google colab: [Here](https://colab.research.google.com/github/ssciwr/misinformation/blob/main/notebooks/multimodal_search.ipynb)**
 1. Object analysis: Use the notebook `ojects_expression.ipynb` to identify certain objects in the image. Currently, the following objects are being identified: person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, cell phone.
+**You can run this notebook on google colab: [Here](https://colab.research.google.com/github/ssciwr/misinformation/blob/main/notebooks/objects_expression.ipynb)**

-There are further notebooks that are currently of exploratory nature (`colors_expression.ipynb` to identify certain colors on the image).
+There are further notebooks that are currently of exploratory nature (`colors_expression.ipynb` to identify certain colors on the image). To crop social media posts use the `cropposts.ipynb` notebook.

 ## Features
 ### Text extraction
--- a/docs/source/notebooks/Example
+++ b/docs/source/notebooks/Example
@ -83,7 +83,9 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "b37c0c91",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "mydict = mutils.initialize_dict(images)"
@ -102,7 +104,9 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "992499ed-33f1-4425-ad5d-738cf565d175",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "mdisplay.explore_analysis(mydict, identify=\"faces\")"
@ -120,7 +124,9 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "6f97c7d0",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "for key in mydict.keys():\n",
@ -139,7 +145,9 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "604bd257",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "outdict = mutils.append_data_to_dict(mydict)\n",
@ -158,7 +166,9 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "aa4b518a",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "df.head(10)"
@ -176,7 +186,9 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "4618decb",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "df.to_csv(\"data/data_out.csv\")"
@ -193,7 +205,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
--- a/docs/source/notebooks/Example
+++ b/docs/source/notebooks/Example
@ -292,7 +292,9 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "e78646d6-80be-4d3e-8123-3360957bcaa8",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "df.head(10)"
@ -310,16 +312,26 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "185f7dde-20dc-44d8-9ab0-de41f9b5734d",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "df.to_csv(\"./data_out.csv\")"
   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2ef1132f-eb2a-43d7-be1f-69e879490f33",
+   "metadata": {},
+   "outputs": [],
+   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
--- a/docs/source/notebooks/Example
+++ b/docs/source/notebooks/Example
@ -72,7 +72,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "mdisplay.explore_analysis(mydict, identify=\"objects\")"
@ -88,7 +90,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "for key in mydict:\n",
@ -105,7 +109,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "outdict = mutils.append_data_to_dict(mydict)\n",
@ -122,7 +128,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "df.head(10)"
@ -138,16 +146,25 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "df.to_csv(\"./data_out.csv\")"
   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
--- a/docs/source/notebooks/Example
+++ b/docs/source/notebooks/Example
@ -17,7 +17,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "from misinformation import utils as mutils\n",
@ -35,7 +37,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "images = mutils.find_files(\n",
@ -47,7 +51,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "mydict = mutils.initialize_dict(images)"
@ -70,18 +76,22 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "obj = sm.SummaryDetector(mydict)\n",
    "summary_model, summary_vis_processors = obj.load_model(\"base\")\n",
-    "# summary_model, summary_vis_processors = mutils.load_model(\"large\")"
+    "# summary_model, summary_vis_processors = obj.load_model(\"large\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "for key in mydict:\n",
@ -121,7 +131,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "df.head(10)"
@ -137,7 +149,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "df.to_csv(\"./data_out.csv\")"
@ -159,7 +173,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "mdisplay.explore_analysis(mydict, identify=\"summary\")"
--- a/docs/source/notebooks/Example
+++ b/docs/source/notebooks/Example
@ -1,7 +1,6 @@
 {
 "cells": [
  {
-   "attachments": {},
   "cell_type": "markdown",
   "id": "dcaa3da1",
   "metadata": {},
@ -14,7 +13,9 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "f43f327c",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "# if running on google colab\n",
@ -37,22 +38,23 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "cf362e60",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
-    "import os\n",
-    "from IPython.display import Image, display\n",
    "import misinformation\n",
    "from misinformation import utils as mutils\n",
-    "from misinformation import display as mdisplay\n",
-    "import tensorflow as tf"
+    "from misinformation import display as mdisplay"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27675810",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "# download the models if they are not there yet\n",
@ -64,35 +66,27 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "6da3a7aa",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "images = mutils.find_files(path=\"data\", limit=10)"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bf811ce0",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "for i in images:\n",
-    "    display(Image(filename=i))"
-   ]
-  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b32409f",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "mydict = mutils.initialize_dict(images)"
   ]
  },
  {
-   "attachments": {},
   "cell_type": "markdown",
   "id": "7b8b929f",
   "metadata": {},
@ -113,7 +107,9 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "7c6ecc88",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "mdisplay.explore_analysis(mydict, identify=\"text-on-image\")"
@ -131,11 +127,12 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "365c78b1-7ff4-4213-86fa-6a0a2d05198f",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "for key in mydict:\n",
-    "    print(key)\n",
    "    mydict[key] = misinformation.text.TextDetector(\n",
    "        mydict[key], analyse_text=True\n",
    "    ).analyse_image()"
@ -153,7 +150,9 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "5709c2cd",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "outdict = mutils.append_data_to_dict(mydict)\n",
@ -164,7 +163,9 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "c4f05637",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "# check the dataframe\n",
@ -175,17 +176,27 @@
   "cell_type": "code",
   "execution_count": null,
   "id": "bf6c9ddb",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "# Write the csv\n",
    "df.to_csv(\"./data_out.csv\")"
   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9012544e-f818-46ea-b087-3e150850a5d5",
+   "metadata": {},
+   "outputs": [],
+   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
@ -199,7 +210,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.5"
+   "version": "3.9.16"
  },
  "vscode": {
   "interpreter": {
--- a/misinformation/test/test_text.py
+++ b/misinformation/test/test_text.py
@ -122,12 +122,30 @@ def test_text_summary(get_path):
    ref_file = get_path + "example_summary.txt"
    with open(ref_file, "r", encoding="utf8") as file:
        reference_text = file.read()
-    test_obj.subdict["text_english"] = reference_text
+    mydict["text_english"] = reference_text
    test_obj.text_summary()
    reference_summary = " I’m sorry, but I don’t want to be an emperor. That’s not my business. I should like to help everyone - if possible - Jew, Gentile - black man - white . We all want to help one another. In this world there is room for everyone. The way of life can be free and beautiful, but we have lost the way ."
    assert mydict["summary_text"] == reference_summary


+def test_text_sentiment_transformers():
+    mydict = {}
+    test_obj = tt.TextDetector(mydict, analyse_text=True)
+    mydict["text_english"] = "I am happy that the CI is working again."
+    test_obj.text_sentiment_transformers()
+    assert mydict["sentiment"] == "POSITIVE"
+    assert mydict["sentiment_score"] == pytest.approx(0.99, 0.01)
+
+
+def test_text_ner():
+    mydict = {}
+    test_obj = tt.TextDetector(mydict, analyse_text=True)
+    mydict["text_english"] = "Bill Gates was born in Seattle."
+    test_obj.text_ner()
+    assert mydict["entity"] == ["Bill", "Gates", "Seattle"]
+    assert mydict["entity_type"] == ["I-PER", "I-PER", "I-LOC"]
+
+
 def test_PostprocessText(set_testdict, get_path):
    reference_dict = "THE\nALGEBRAIC\nEIGENVALUE\nPROBLEM\nDOM\nNVS TIO\nMINA\nMonographs\non Numerical Analysis\nJ.. H. WILKINSON"
    reference_df = "Mathematische Formelsammlung\nfür Ingenieure und Naturwissenschaftler\nMit zahlreichen Abbildungen und Rechenbeispielen\nund einer ausführlichen Integraltafel\n3., verbesserte Auflage"
--- a/misinformation/text.py
+++ b/misinformation/text.py
@ -122,26 +122,57 @@ class TextDetector(utils.AnalysisMethod):

    def text_summary(self):
        # use the transformers pipeline to summarize the text
-        pipe = pipeline("summarization")
+        # use the current default model - 03/2023
+        model_name = "sshleifer/distilbart-cnn-12-6"
+        model_revision = "a4f8f3e"
+        pipe = pipeline("summarization", model=model_name, revision=model_revision)
        self.subdict.update(pipe(self.subdict["text_english"])[0])

-    # def text_sentiment_transformers(self):
-    # pipe = pipeline("text-classification")
+    def text_sentiment_transformers(self):
+        # use the transformers pipeline for text classification
+        # use the current default model - 03/2023
+        model_name = "distilbert-base-uncased-finetuned-sst-2-english"
+        model_revision = "af0f99b"
+        pipe = pipeline(
+            "text-classification", model=model_name, revision=model_revision
+        )
+        result = pipe(self.subdict["text_english"])
+        self.subdict["sentiment"] = result[0]["label"]
+        self.subdict["sentiment_score"] = result[0]["score"]
+
+    def text_ner(self):
+        # use the transformers pipeline for named entity recognition
+        # use the current default model - 03/2023
+        model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
+        model_revision = "f2482bf"
+        pipe = pipeline(
+            "token-classification", model=model_name, revision=model_revision
+        )
+        result = pipe(self.subdict["text_english"])
+        self.subdict["entity"] = []
+        self.subdict["entity_type"] = []
+        for entity in result:
+            self.subdict["entity"].append(entity["word"])
+            self.subdict["entity_type"].append(entity["entity"])


 class PostprocessText:
    def __init__(
-        self, mydict: dict = None, use_csv: bool = False, csv_path: str = None
+        self,
+        mydict: dict = None,
+        use_csv: bool = False,
+        csv_path: str = None,
+        analyze_text: str = "text_english",
    ) -> None:
        self.use_csv = use_csv
        if mydict:
            print("Reading data from dict.")
            self.mydict = mydict
-            self.list_text_english = self.get_text_dict()
+            self.list_text_english = self.get_text_dict(analyze_text)
        elif self.use_csv:
            print("Reading data from df.")
            self.df = pd.read_csv(csv_path, encoding="utf8")
-            self.list_text_english = self.get_text_df()
+            self.list_text_english = self.get_text_df(analyze_text)
        else:
            raise ValueError(
                "Please provide either dictionary with textual data or \
@ -177,24 +208,28 @@ class PostprocessText:
            most_frequent_topics.append(self.topic_model.get_topic(i))
        return self.topic_model, topic_df, most_frequent_topics

-    def get_text_dict(self):
-        # use dict to put text_english in list
+    def get_text_dict(self, analyze_text):
+        # use dict to put text_english or text_summary in list
        list_text_english = []
        for key in self.mydict.keys():
-            if "text_english" not in self.mydict[key]:
+            if analyze_text not in self.mydict[key]:
                raise ValueError(
                    "Please check your provided dictionary - \
-                no english text data found."
+                no {} text data found.".format(
+                        analyze_text
+                    )
                )
-            list_text_english.append(self.mydict[key]["text_english"])
+            list_text_english.append(self.mydict[key][analyze_text])
        return list_text_english

-    def get_text_df(self):
-        # use csv file to obtain dataframe and put text_english in list
-        # check that "text_english" is there
-        if "text_english" not in self.df:
+    def get_text_df(self, analyze_text):
+        # use csv file to obtain dataframe and put text_english or text_summary in list
+        # check that "text_english" or "text_summary" is there
+        if analyze_text not in self.df:
            raise ValueError(
                "Please check your provided dataframe - \
-                                no english text data found."
+                                no {} text data found.".format(
+                    analyze_text
+                )
            )
-        return self.df["text_english"].tolist()
+        return self.df[analyze_text].tolist()
--- a/notebooks/facial_expressions.ipynb
+++ b/notebooks/facial_expressions.ipynb
@ -66,9 +66,11 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "# Here you need to provide the path to your google drive folder\n",
+    "# or local folder containing the images\n",
    "images = mutils.find_files(\n",
-    "    path=\"drive/MyDrive/misinformation-data/\",\n",
-    "    limit=1000,\n",
+    "    path=\"/content/drive/MyDrive/misinformation-data/\",\n",
+    "    limit=10,\n",
    ")"
   ]
  },
@ -105,7 +107,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "mydict = mutils.initialize_dict(images[0:4])"
+    "mydict = mutils.initialize_dict(images)"
   ]
  },
  {
--- a/notebooks/get-text-from-image.ipynb
+++ b/notebooks/get-text-from-image.ipynb
@ -40,13 +40,9 @@
   "outputs": [],
   "source": [
    "import os\n",
-    "from IPython.display import Image, display\n",
    "import misinformation\n",
    "from misinformation import utils as mutils\n",
-    "from misinformation import display as mdisplay\n",
-    "import tensorflow as tf\n",
-    "\n",
-    "print(tf.config.list_physical_devices(\"GPU\"))"
+    "from misinformation import display as mdisplay"
   ]
  },
  {
@ -56,30 +52,12 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# download the models if they are not there yet\n",
-    "!python -m spacy download en_core_web_md\n",
-    "!python -m textblob.download_corpora"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6da3a7aa",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "images = mutils.find_files(path=\"../data/all/\", limit=1000)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bf811ce0",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "for i in images[0:3]:\n",
-    "    display(Image(filename=i))"
+    "# Here you need to provide the path to your google drive folder\n",
+    "# or local folder containing the images\n",
+    "images = mutils.find_files(\n",
+    "    path=\"/content/drive/MyDrive/misinformation-data/\",\n",
+    "    limit=10,\n",
+    ")"
   ]
  },
  {
@ -89,7 +67,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "mydict = mutils.initialize_dict(images[0:3])"
+    "mydict = mutils.initialize_dict(images)"
   ]
  },
  {
@ -110,7 +88,7 @@
   "source": [
    "os.environ[\n",
    "    \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
-    "] = \"../data/misinformation-campaign-981aa55a3b13.json\""
+    "] = \"/content/drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\""
   ]
  },
  {
--- a/notebooks/image_summary.ipynb
+++ b/notebooks/image_summary.ipynb
@ -14,6 +14,28 @@
    "This notebooks shows some preliminary work on Image Captioning and Visual question answering with lavis. It is mainly meant to explore its capabilities and to decide on future research directions. We package our code into a `misinformation` package that is imported here:"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# if running on google colab\n",
+    "# flake8-noqa-cell\n",
+    "import os\n",
+    "\n",
+    "if \"google.colab\" in str(get_ipython()):\n",
+    "    # update python version\n",
+    "    # install setuptools\n",
+    "    !pip install setuptools==61 -qqq\n",
+    "    # install misinformation\n",
+    "    !pip install git+https://github.com/ssciwr/misinformation.git -qqq\n",
+    "    # mount google drive for data and API key\n",
+    "    from google.colab import drive\n",
+    "\n",
+    "    drive.mount(\"/content/drive\")"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -43,9 +65,11 @@
   },
   "outputs": [],
   "source": [
+    "# Here you need to provide the path to your google drive folder\n",
+    "# or local folder containing the images\n",
    "images = mutils.find_files(\n",
-    "    path=\"../misinformation/test/data/\",\n",
-    "    limit=1000,\n",
+    "    path=\"/content/drive/MyDrive/misinformation-data/\",\n",
+    "    limit=10,\n",
    ")"
   ]
  },
@ -57,18 +81,7 @@
   },
   "outputs": [],
   "source": [
-    "mydict = mutils.initialize_dict(images[0:10])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "mydict"
+    "mydict = mutils.initialize_dict(images)"
   ]
  },
  {
--- a/notebooks/multimodal_search.ipynb
+++ b/notebooks/multimodal_search.ipynb
@ -16,6 +16,29 @@
    "This notebooks shows some preliminary work on Image Multimodal Search with lavis library. It is mainly meant to explore its capabilities and to decide on future research directions. We package our code into a `misinformation` package that is imported here:"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0b0a6bdf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# if running on google colab\n",
+    "# flake8-noqa-cell\n",
+    "import os\n",
+    "\n",
+    "if \"google.colab\" in str(get_ipython()):\n",
+    "    # update python version\n",
+    "    # install setuptools\n",
+    "    !pip install setuptools==61 -qqq\n",
+    "    # install misinformation\n",
+    "    !pip install git+https://github.com/ssciwr/misinformation.git -qqq\n",
+    "    # mount google drive for data and API key\n",
+    "    from google.colab import drive\n",
+    "\n",
+    "    drive.mount(\"/content/drive\")"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -25,7 +48,7 @@
   },
   "outputs": [],
   "source": [
-    "import misinformation\n",
+    "import misinformation.utils as mutils\n",
    "import misinformation.multimodal_search as ms"
   ]
  },
@ -46,9 +69,11 @@
   },
   "outputs": [],
   "source": [
-    "images = misinformation.utils.find_files(\n",
-    "    path=\"../data/images/\",\n",
-    "    limit=1000,\n",
+    "# Here you need to provide the path to your google drive folder\n",
+    "# or local folder containing the images\n",
+    "images = mutils.find_files(\n",
+    "    path=\"/content/drive/MyDrive/misinformation-data/\",\n",
+    "    limit=10,\n",
    ")"
   ]
  },
@ -61,19 +86,7 @@
   },
   "outputs": [],
   "source": [
-    "mydict = misinformation.utils.initialize_dict(images)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d98b6227-886d-41b8-a377-896dd8ab3c2a",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "mydict"
+    "mydict = mutils.initialize_dict(images)"
   ]
  },
  {
--- a/notebooks/objects_expression.ipynb
+++ b/notebooks/objects_expression.ipynb
@ -14,6 +14,28 @@
    "This notebooks shows some preliminary work on detecting objects expressions with cvlib. It is mainly meant to explore its capabilities and to decide on future research directions. We package our code into a `misinformation` package that is imported here:"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# if running on google colab\n",
+    "# flake8-noqa-cell\n",
+    "import os\n",
+    "\n",
+    "if \"google.colab\" in str(get_ipython()):\n",
+    "    # update python version\n",
+    "    # install setuptools\n",
+    "    !pip install setuptools==61 -qqq\n",
+    "    # install misinformation\n",
+    "    !pip install git+https://github.com/ssciwr/misinformation.git -qqq\n",
+    "    # mount google drive for data and API key\n",
+    "    from google.colab import drive\n",
+    "\n",
+    "    drive.mount(\"/content/drive\")"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -39,9 +61,11 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "# Here you need to provide the path to your google drive folder\n",
+    "# or local folder containing the images\n",
    "images = mutils.find_files(\n",
-    "    path=\"../data/images-little-text/\",\n",
-    "    limit=1000,\n",
+    "    path=\"/content/drive/MyDrive/misinformation-data/\",\n",
+    "    limit=10,\n",
    ")"
   ]
  },
@ -54,15 +78,6 @@
    "mydict = mutils.initialize_dict(images)"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "mydict"
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {},
--- a/requirements.txt
+++ b/requirements.txt
@ -1,28 +0,0 @@
-google-cloud-vision
-cvlib
-deepface<=0.0.75
-ipywidgets
-numpy<=1.23.4
-opencv_python
-pandas
-pooch
-protobuf
-retina_face
-setuptools
-tensorflow
-keras
-openpyxl
-pytest
-pytest-cov
-matplotlib
-opencv-contrib-python
-googletrans==3.1.0a0
-spacy
-https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.1/en_core_web_md-3.4.1.tar.gz
-jupyterlab
-spacytextblob
-textblob
-git+https://github.com/sloria/TextBlob.git@dev
-salesforce-lavis
-bertopic
-grpcio