1 - Topic modeling:

This technique allows you for automated assessment of the text content and semantics. More reading: https://en.wikipedia.org/wiki/Topic_model
You may use this for large scale screening to determine oddities in website contents, emails, tweets, discussion forums or even social networks.
Each of the data sources requires specific dataminig approach. 

In this notebook, you will analyze data obtained from Twitter firehose api.
https://developer.twitter.com/en/docs/twitter-api/enterprise/compliance-firehose-api/overview
The advantage of working with this API is that you can request the access as a research or government body and get much more data, compared to privat API access (https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api.)

However, you can use this Jupyter Notebook to process any texts you need.
Depending on the context of your data colelction, you are able to spot forst 3 phases of the disinformation killchain just based on visualization of the topics.

That is:
Recon - When you see that certain topic suddenly resonates within the sampling space. When sampling is repeated to include the increments, there will be minor clusters around the initial structures.
Build - Clusters will be larger and new entities will appear to interact. 
Seed - Similar cluster structures starts to appear in data from multiple sources.

(Copy - Signifficant growth in cluster sizes and entity numbers per monitored info - space. Note that visibility of this phase depends on the method of sampling and may not occur if the sampling rate is too low)



You can use your own datsets if you modify the cell with the directory path below:

In [1]:
#----
directory="./dataset-kherson/kherson-11-2022/all-lang/"
#----

The code below is set up for you so that you do not have to change anything. 
Simply run each cell and see the output. 

In [2]:
import json
import gzip
import pandas as pd
from tqdm import tqdm
import os

df_ac=pd.DataFrame()
print("Consolidation of the files into one dataframe...")
for fil in tqdm(os.listdir(directory)):
    flen=len(os.listdir(directory))
    file=fil
    filename=str(str(directory)+(os.fsdecode(fil)))
    if filename.endswith(".gz"):
        try:
            df_act = pd.read_json(filename,lines=True)
            df_ac=df_ac.append(df_act)
        except:
            continue
        continue
    else:
        continue

print("Dataframe generated!")
#print("Dataframe size is :", df_ac.size)
#print(df_ac.head())
#print(df_ac.columns)
#print(df_ac[["lang"]].head(5))
#print(df_ac["text"].head(2))

  0%|          | 0/73 [00:00<?, ?it/s]

Consolidation of the files into one dataframe...


100%|██████████| 73/73 [00:38<00:00,  1.91it/s]


The dataset you are working with has also information about the language of the tweet. It is often off, but can be used as a raw determination of "language space" of the tweet.
If you sort tweets but language, you can get insight to cultural / geopolitical context and also see the differences of the threat actor activity. 
You can see for yourself - the code was prerun and you can see the differences in the languages and the results are stored in html files.
That means, you can open and view the files in any web browser.

The dataset folder: 

"/Documents/decode-cw/1-Topic_modeling/dataset-kherson/kherson-11-2022/all-lang/"

In [3]:
# Language statistics:

df_ac[["lang"]].value_counts().head(15)

lang
en      232189
fr       13072
es       11864
it        6394
de        6094
cy        3977
pl        3609
und       2749
cs        2135
nl        2113
pt        2004
uk        1555
fi        1394
tr        1261
qme       1212
dtype: int64

You can change the desired language of the tweets, or you can leave the default "en"  (as for "English") value in the code below:

In [4]:
LANG="en"

In [5]:
# Language filter:
df_ac=df_ac.loc[df_ac["lang"]==LANG]
df_ac[["lang"]].value_counts().head()

lang
en      232189
dtype: int64

In [None]:
#Corpus saving to the disc: 
df_ac['text'].to_csv(r'./dataset-kherson/kherson-11-2022/all-lang/corpora-en.txt', header=None, index=None, sep=' ', mode='a')

The code below does the actual topic modeling. Further reading with more examples below:

https://spacy.io/universe/project/bertopic;
https://datascience.stackexchange.com/questions/108178/how-to-prepare-texts-to-bert-roberta-models;
https://albertauyeung.github.io/2020/06/19/bert-tokenization.html/;
https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html;
https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-documents

In [6]:
import re
import pandas as pd
from bertopic import BERTopic

tweets=df_ac

tweets.text = tweets.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
tweets.text = tweets.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
tweets.text = tweets.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
timestamps = tweets.created_at.to_list()
print(timestamps[:5])
tweets = tweets.text.to_list()
print(tweets[:5])

# Create topics over time - only for the data with timestamps in standardised formate
model = BERTopic(verbose=True)
topics, probs = model.fit_transform(tweets)
#topics_over_time = model.topics_over_time(tweets, topics, timestamps)
#topics_over_time2 = model.topics_over_time(tweets, topics, timestamps, nr_bins=20)
#topics_over_time2 = model.topics_over_time(tweets, topics, timestamps)

#Save the model:
#model.save("Bertopic_model")
#Load the model:
#BERTopic.load("Bertopic_model")

#https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.visualize_topics
#figx=model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91]) #selection of particular topics
#figx.show()
#figx.write_html(directory+"topics_over_time.html")
fig = model.visualize_topics()
#fig.show()
fig.write_html(directory+"visualised_topics.html")
fig2=model.visualize_barchart()
#fig2.show()
fig2.write_html(directory+"visualise_barchart.html")
fig3=model.visualize_hierarchy()
#fig3.show()
fig3.write_html(directory+"visualise_hierarchy.html")
#fig4=model.visualize_topics_over_time(topics_over_time2, top_n_topics=25)
#fig4.show()
#fig4.write_html(directory+"visualize_topics_over_time.html")

print("Topic modeling finished, go to the"+" "+ directory+" "+ "to inspect the results in html formate.")

[Timestamp('2022-11-09 00:58:38+0000', tz='UTC'), Timestamp('2022-11-09 00:14:54+0000', tz='UTC'), Timestamp('2022-11-11 00:11:39+0000', tz='UTC'), Timestamp('2022-11-11 00:13:17+0000', tz='UTC'), Timestamp('2022-11-11 00:13:53+0000', tz='UTC')]
['rt kherson was is and always will be ukraine godspeed to the afu', 'rt alex debrincat to drake batherson starts this game off quickly way to get on the board sens gosensgo', 'rt the armed forces completely surrounded kherson', 'rt ukrainian forces entering kherson right now', 'rt a short timelapse of the progress by towards kherson over the past few days the changes only show the verified advanc']


Batches: 100%|██████████| 7256/7256 [01:57<00:00, 61.79it/s] 
2022-12-01 16:29:05,240 - BERTopic - Transformed documents to Embeddings
2022-12-01 16:38:04,504 - BERTopic - Reduced dimensionality
2022-12-01 16:38:25,271 - BERTopic - Clustered reduced embeddings


Topic modeling finished, go to the ./dataset-kherson/kherson-11-2022/all-lang/ to inspect the results in html formate.


The files are too complex to be loaded here, especialy in case of larger datasets.
You can open the outputs in the browser, go to this folder: "/Documents/decode-cw/1-Topic_modeling/dataset-kherson/kherson-11-2022/all-lang"