UNICRI-Decode-CW/topic-modelling.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1 - Topic modeling:\n",
    "\n",
    "This technique allows you for automated assessment of the text content and semantics. More reading: https://en.wikipedia.org/wiki/Topic_model\n",
    "You may use this for large scale screening to determine oddities in website contents, emails, tweets, discussion forums or even social networks.\n",
    "Each of the data sources requires specific dataminig approach. \n",
    "\n",
    "In this notebook, you will analyze data obtained from Twitter firehose api.\n",
    "https://developer.twitter.com/en/docs/twitter-api/enterprise/compliance-firehose-api/overview\n",
    "The advantage of working with this API is that you can request the access as a research or government body and get much more data, compared to privat API access (https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api.)\n",
    "\n",
    "However, you can use this Jupyter Notebook to process any texts you need.\n",
    "Depending on the context of your data colelction, you are able to spot forst 3 phases of the disinformation killchain just based on visualization of the topics.\n",
    "\n",
    "That is:\n",
    "Recon - When you see that certain topic suddenly resonates within the sampling space. When sampling is repeated to include the increments, there will be minor clusters around the initial structures.\n",
    "Build - Clusters will be larger and new entities will appear to interact. \n",
    "Seed - Similar cluster structures starts to appear in data from multiple sources.\n",
    "\n",
    "(Copy - Signifficant growth in cluster sizes and entity numbers per monitored info - space. Note that visibility of this phase depends on the method of sampling and may not occur if the sampling rate is too low)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use your own datsets if you modify the cell with the directory path below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#----\n",
    "directory=\"./dataset-kherson/kherson-11-2022/all-lang/\"\n",
    "#----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code below is set up for you so that you do not have to change anything. \n",
    "Simply run each cell and see the output. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/73 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Consolidation of the files into one dataframe...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 73/73 [00:38<00:00,  1.91it/s]\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "import gzip\n",
    "import pandas as pd\n",
    "from tqdm import tqdm\n",
    "import os\n",
    "\n",
    "df_ac=pd.DataFrame()\n",
    "print(\"Consolidation of the files into one dataframe...\")\n",
    "for fil in tqdm(os.listdir(directory)):\n",
    "    flen=len(os.listdir(directory))\n",
    "    file=fil\n",
    "    filename=str(str(directory)+(os.fsdecode(fil)))\n",
    "    if filename.endswith(\".gz\"):\n",
    "        try:\n",
    "            df_act = pd.read_json(filename,lines=True)\n",
    "            df_ac=df_ac.append(df_act)\n",
    "        except:\n",
    "            continue\n",
    "        continue\n",
    "    else:\n",
    "        continue\n",
    "\n",
    "print(\"Dataframe generated!\")\n",
    "#print(\"Dataframe size is :\", df_ac.size)\n",
    "#print(df_ac.head())\n",
    "#print(df_ac.columns)\n",
    "#print(df_ac[[\"lang\"]].head(5))\n",
    "#print(df_ac[\"text\"].head(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset you are working with has also information about the language of the tweet. It is often off, but can be used as a raw determination of \"language space\" of the tweet.\n",
    "If you sort tweets but language, you can get insight to cultural / geopolitical context and also see the differences of the threat actor activity. \n",
    "You can see for yourself - the code was prerun and you can see the differences in the languages and the results are stored in html files.\n",
    "That means, you can open and view the files in any web browser.\n",
    "\n",
    "The dataset folder: \n",
    "\n",
    "\"/Documents/decode-cw/1-Topic_modeling/dataset-kherson/kherson-11-2022/all-lang/\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "lang\n",
       "en      232189\n",
       "fr       13072\n",
       "es       11864\n",
       "it        6394\n",
       "de        6094\n",
       "cy        3977\n",
       "pl        3609\n",
       "und       2749\n",
       "cs        2135\n",
       "nl        2113\n",
       "pt        2004\n",
       "uk        1555\n",
       "fi        1394\n",
       "tr        1261\n",
       "qme       1212\n",
       "dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Language statistics:\n",
    "\n",
    "df_ac[[\"lang\"]].value_counts().head(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can change the desired language of the tweets, or you can leave the default \"en\"  (as for \"English\") value in the code below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "LANG=\"en\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "lang\n",
       "en      232189\n",
       "dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Language filter:\n",
    "df_ac=df_ac.loc[df_ac[\"lang\"]==LANG]\n",
    "df_ac[[\"lang\"]].value_counts().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Corpus saving to the disc: \n",
    "df_ac['text'].to_csv(r'./dataset-kherson/kherson-11-2022/all-lang/corpora-en.txt', header=None, index=None, sep=' ', mode='a')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code below does the actual topic modeling. Further reading with more examples below:\n",
    "\n",
    "https://spacy.io/universe/project/bertopic;\n",
    "https://datascience.stackexchange.com/questions/108178/how-to-prepare-texts-to-bert-roberta-models;\n",
    "https://albertauyeung.github.io/2020/06/19/bert-tokenization.html/;\n",
    "https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html;\n",
    "https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Timestamp('2022-11-09 00:58:38+0000', tz='UTC'), Timestamp('2022-11-09 00:14:54+0000', tz='UTC'), Timestamp('2022-11-11 00:11:39+0000', tz='UTC'), Timestamp('2022-11-11 00:13:17+0000', tz='UTC'), Timestamp('2022-11-11 00:13:53+0000', tz='UTC')]\n",
      "['rt kherson was is and always will be ukraine godspeed to the afu', 'rt alex debrincat to drake batherson starts this game off quickly way to get on the board sens gosensgo', 'rt the armed forces completely surrounded kherson', 'rt ukrainian forces entering kherson right now', 'rt a short timelapse of the progress by towards kherson over the past few days the changes only show the verified advanc']\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Batches: 100%|██████████| 7256/7256 [01:57<00:00, 61.79it/s] \n",
      "2022-12-01 16:29:05,240 - BERTopic - Transformed documents to Embeddings\n",
      "2022-12-01 16:38:04,504 - BERTopic - Reduced dimensionality\n",
      "2022-12-01 16:38:25,271 - BERTopic - Clustered reduced embeddings\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Topic modeling finished, go to the ./dataset-kherson/kherson-11-2022/all-lang/ to inspect the results in html formate.\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "import pandas as pd\n",
    "from bertopic import BERTopic\n",
    "\n",
    "tweets=df_ac\n",
    "\n",
    "tweets.text = tweets.apply(lambda row: re.sub(r\"http\\S+\", \"\", row.text).lower(), 1)\n",
    "tweets.text = tweets.apply(lambda row: \" \".join(filter(lambda x:x[0]!=\"@\", row.text.split())), 1)\n",
    "tweets.text = tweets.apply(lambda row: \" \".join(re.sub(\"[^a-zA-Z]+\", \" \", row.text).split()), 1)\n",
    "timestamps = tweets.created_at.to_list()\n",
    "print(timestamps[:5])\n",
    "tweets = tweets.text.to_list()\n",
    "print(tweets[:5])\n",
    "\n",
    "# Create topics over time - only for the data with timestamps in standardised formate\n",
    "model = BERTopic(verbose=True)\n",
    "topics, probs = model.fit_transform(tweets)\n",
    "#topics_over_time = model.topics_over_time(tweets, topics, timestamps)\n",
    "#topics_over_time2 = model.topics_over_time(tweets, topics, timestamps, nr_bins=20)\n",
    "#topics_over_time2 = model.topics_over_time(tweets, topics, timestamps)\n",
    "\n",
    "#Save the model:\n",
    "#model.save(\"Bertopic_model\")\n",
    "#Load the model:\n",
    "#BERTopic.load(\"Bertopic_model\")\n",
    "\n",
    "#https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.visualize_topics\n",
    "#figx=model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91]) #selection of particular topics\n",
    "#figx.show()\n",
    "#figx.write_html(directory+\"topics_over_time.html\")\n",
    "fig = model.visualize_topics()\n",
    "#fig.show()\n",
    "fig.write_html(directory+\"visualised_topics.html\")\n",
    "fig2=model.visualize_barchart()\n",
    "#fig2.show()\n",
    "fig2.write_html(directory+\"visualise_barchart.html\")\n",
    "fig3=model.visualize_hierarchy()\n",
    "#fig3.show()\n",
    "fig3.write_html(directory+\"visualise_hierarchy.html\")\n",
    "#fig4=model.visualize_topics_over_time(topics_over_time2, top_n_topics=25)\n",
    "#fig4.show()\n",
    "#fig4.write_html(directory+\"visualize_topics_over_time.html\")\n",
    "\n",
    "print(\"Topic modeling finished, go to the\"+\" \"+ directory+\" \"+ \"to inspect the results in html formate.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The files are too complex to be loaded here, especialy in case of larger datasets.\n",
    "You can open the outputs in the browser, go to this folder: \"/Documents/decode-cw/1-Topic_modeling/dataset-kherson/kherson-11-2022/all-lang\""
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3.8.8 ('eolas-py3.8.8')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "vscode": {
   "interpreter": {
    "hash": "7217508cf40a866d2c6d8c05c8a287a7af39b44bc942772df40ff0edc82b5da6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}