Этот коммит содержится в:
Eolas 2023-01-13 11:58:03 +01:00 коммит произвёл GitHub
Коммит 39b5c7f66f
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
4 изменённых файлов: 2196 добавлений и 0 удалений

503
2-Named-Entity-Recognition.ipynb Обычный файл
Просмотреть файл

@ -0,0 +1,503 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2 - Named Entity Recognition:\n",
"\n",
"This is just a raw example, how the entity recogniton works.\n",
"It uses the pre-trained language based models, based on news, wikipedia or website data. The initial creation of such model is expensive on computational resources. \n",
"See this lik for more details: https://spacy.io/models\n",
"Note that this techique is language sensitive and you have to know in advance, what is the language of your dataset. Our training example is in English.\n",
"\n",
"This is the link to Ukrainian model that you may use from the beginning and add additional precision by trianing it on the field data: https://spacy.io/models/uk\n",
"\n",
"More reading: https://towardsdatascience.com/named-entity-recognition-ner-using-spacy-nlp-part-4-28da2ece57c6"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, set up the proper directory:\n",
"(You don't have to change the line below if you want to process the default dataset.)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"#----\n",
"directory=\"./dataset-zelenskywarcriminal/all-lang/\"\n",
"#----"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 3/3 [00:00<00:00, 74.57it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Consolidation of the files into one dataframe...\n",
"Dataframe generated!\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"import json\n",
"import gzip\n",
"import pandas as pd\n",
"from tqdm import tqdm\n",
"import os\n",
"\n",
"\n",
"df_ac=pd.DataFrame()\n",
"print(\"Consolidation of the files into one dataframe...\")\n",
"\n",
"for fil in tqdm(os.listdir(directory)):\n",
" flen=len(os.listdir(directory))\n",
" file=fil\n",
" #print(file)\n",
" filename=str(str(directory)+(os.fsdecode(fil)))\n",
" if filename.endswith(\".gz\"):\n",
" try:\n",
" df_act = pd.read_json(filename,lines=True)\n",
" df_ac=df_ac.append(df_act)\n",
" except:\n",
" continue\n",
" \n",
" continue\n",
" else:\n",
" continue\n",
"\n",
"print(\"Dataframe generated!\")\n",
"#print(df_ac.head())\n",
"#print(df_ac.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check the counts of languages, present in the dataset:\n",
"\n",
"For next processing, we chose to work with English only.\n",
"In the field you will need to use multiple language models, or good translator. In case of the latter, I recommend the Deepl AI.\n",
"https://www.deepl.com/en/docs-api Compared to Google Translate (as of 5-12-2022) Deepl seems to have better understanding for semantics and provides good API to work with."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"lang\n",
"en 102\n",
"fr 45\n",
"it 16\n",
"es 12\n",
"qme 10\n",
"dtype: int64"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#print(df_ac[[\"lang\"]].head(5))\n",
"#print(df_ac[\"text\"].head(2))\n",
"df_ac[[\"lang\"]].value_counts().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Choose the language if interest, or keep the current preset for English:"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"LANG='en'"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"lang\n",
"en 102\n",
"dtype: int64"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Filter only the language with available model in spacy\n",
"df_ac=df_ac.loc[df_ac[\"lang\"]==LANG]\n",
"df_ac[[\"lang\"]].value_counts().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Codename for the English language model: \"en_core_web_sm\"\n",
" (See this link to find more models: https://spacy.io/models)"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"LANG_MODEL_CODE='en_core_web_sm'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the code below to extract the Named entities, sentiment and keywords from the text:\n",
"\n",
"The output is both CSV and \"pickled\" dataframe. Since the tweets contains interpunction, that breaks the usual CSV formatting, the pickled output is also inlcuded. You can use this \"pickled\" file as an input for further analysis and the formatting will work.\n",
"Both output files will be saved to the same directory, that keeps this notebook."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loaded slice of dataframe with text.\n",
"Loaded keywords generated from the text column.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"<ipython-input-63-2d83fc9cbbb3>:44: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" df_ac1['text_keywords']=df_ac1['text'].apply(text_keywords_gen)\n",
"<ipython-input-63-2d83fc9cbbb3>:46: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" df_ac1['sentiment']=df_ac1['text'].apply(sentiment_detector)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sentiment detection finished.\n",
"NER - Named Entity Recognition finished.\n",
"Done!\n",
"Saving df to pickle..\n",
"Df pickled!\n",
"Saving pickled DF to csv...\n",
"df_ac saved to csv!\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"<ipython-input-63-2d83fc9cbbb3>:48: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" df_ac1['ner_detection']=df_ac1['text'].apply(ner_extract)\n"
]
}
],
"source": [
"def text_keywords_gen(user_input): \n",
" import requests\n",
" import bs4\n",
" from nltk.tokenize import sent_tokenize\n",
" from gensim.summarization import summarize\n",
" from gensim.summarization import keywords\n",
" #text= input_texter(user_input)\n",
" text=user_input\n",
" #print(\"\\nKeywords: \")\n",
" #print(keywords(user_input).split(\"\\n\"))\n",
" kwrds=keywords(user_input).split(\"\\n\")\n",
" return kwrds\n",
"\n",
"def sentiment_detector(span):\n",
" import spacy\n",
" from spacytextblob.spacytextblob import SpacyTextBlob\n",
" nlp=spacy.load(LANG_MODEL_CODE)\n",
" nlp.add_pipe('spacytextblob')\n",
" doc=nlp(span)\n",
" sentiment={\"polarity\":doc._.polarity,\"subjectivity\":doc._.subjectivity,\"assessments\":doc._.assessments}\n",
" #print(str(sentiment))\n",
" return sentiment\n",
"\n",
"def ner_extract(sentence):\n",
" #Named entity recognition in a span / sentence.\n",
" import spacy\n",
" nlp=spacy.load(\"en_core_web_lg\")\n",
" doc=nlp(sentence)\n",
" entities=[]\n",
" for ent in doc.ents:\n",
" #print(ent.text, ent.start_char, ent.end_char, ent.label_)\n",
" entities.append([ent.text, ent.start_char, ent.end_char, ent.label_])\n",
" return entities\n",
"\n",
"def pickled_df_to_csv(infile_pickle,outfile_csv):\n",
" import pandas as pd\n",
" df=pd.read_pickle(infile_pickle)\n",
" df.to_csv(outfile_csv)\n",
" #print(\"CSV file saved to:\" +\" \"+outfile_csv)\n",
"\n",
"df_ac1=df_ac[['id','text',\"lang\"]]\n",
"print(\"Loaded slice of dataframe with text.\")\n",
"df_ac[\"lang_model\"]=df_ac[\"lang\"].apply(spacy_lang_assign)\n",
"df_ac1['text_keywords']=df_ac1['text'].apply(text_keywords_gen)\n",
"print(\"Loaded keywords generated from the text column.\")\n",
"df_ac1['sentiment']=df_ac1['text'].apply(sentiment_detector)\n",
"print('Sentiment detection finished.')\n",
"df_ac1['ner_detection']=df_ac1['text'].apply(ner_extract)\n",
"print(\"NER - Named Entity Recognition finished.\")\n",
"print(\"Done!\")\n",
"#print(df_ac1.head(3))\n",
"\n",
"print(\"Saving df to pickle..\")\n",
"infilef=directory+\"/pickled-twitter-dump-df-keywords\"\n",
"infilecsv=directory+\"/df_ac-in-csv.csv\"\n",
"df_ac.to_pickle(infilef,compression=\"infer\",protocol=4)\n",
"print(\"Df pickled!\")\n",
"print(\"Saving pickled DF to csv...\")\n",
"pickled_df_to_csv(infilef,infilecsv)\n",
"print(\"df_ac saved to csv!\")\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: In case of trouble, run also code below. It may occur on some systems, that you need to fix the environment in order to run such comptutational heavy operations like Named entity recognition.\n",
"In provided Virtual Machine the error should be fixed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See the top 10 results for quick check:\n",
"\n",
"(For more detalis on entity types, reffer to this link: https://spacy.io/usage/linguistic-features#named-entities )"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "unhashable type: 'list'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.map_locations\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mTypeError\u001b[0m: unhashable type: 'list'"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'\n",
"Traceback (most recent call last):\n",
" File \"pandas/_libs/hashtable_class_helper.pxi\", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations\n",
"TypeError: unhashable type: 'list'\n"
]
},
{
"data": {
"text/plain": [
"[american, 57, 65, NORP] 3\n",
"[RT @TimBronson5:, 0, 16, ORG] 3\n",
"[RT @NoMoreNATO:, 0, 15, ORG] 3\n",
"[RT @Stop_This_Evil, 0, 18, ORG] 2\n",
"[Modi, 22, 26, PERSON] 2\n",
"[Nazis, 60, 65, NORP] 2\n",
"[@vonderleyen, 0, 12, ORG] 2\n",
"[ZelenskyWarCriminal, 19, 38, NORP] 2\n",
"[ZelenskyWarCriminal, 1, 20, ORG] 2\n",
"[RT, 0, 2, ORG] 2\n",
"[RT @NTY57NTY, 0, 12, ORG] 2\n",
"[@GRDecter, 0, 9, PERSON] 1\n",
"[Ukrainian, 22, 31, NORP] 1\n",
"[Jewish, 53, 59, NORP] 1\n",
"[RT @ianbremmer, 0, 14, ORG] 1\n",
"[@ZelenskyyUa @RishiSunak, 0, 24, ORG] 1\n",
"[RT @Rich_Gally_:, 0, 16, ORG] 1\n",
"[Nigel, 19, 24, PERSON] 1\n",
"[millions, 100, 108, CARDINAL] 1\n",
"[@ZelenskyyUa Be, 14, 29, ORG] 1\n",
"Name: ner_detection, dtype: int64"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ac1['ner_detection'].str[0].value_counts().head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Keywords results:\n",
"\n",
"There are several ways of how to determine what is the keyword in particular text. This method is based on library called \"gensim\" and one of the advanatges is that it is language agnostic.\n"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" 29\n",
"zelenskywarcriminal 6\n",
"zelensky 4\n",
"times 3\n",
"western 2\n",
"eur 2\n",
"Name: text_keywords, dtype: int64"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ac1['text_keywords'].str[0].value_counts().head(6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sentiment analysis:\n",
"\n",
"The method uses combination of rules and pre-trained ML model to determine sentiment from so called \"assessment words\". Since the tweets do not always contain such words, sometimes it is not possible to detect the sentiment properly and the value remains empty.\n",
"Mode details here: https://spacy.io/universe/project/spacy-textblob\n"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Series([], Name: sentiment, dtype: int64)"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ac1['sentiment'].str[0].value_counts().head(20)"
]
}
],
"metadata": {
"celltoolbar": "Raw Cell Format",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
},
"vscode": {
"interpreter": {
"hash": "7217508cf40a866d2c6d8c05c8a287a7af39b44bc942772df40ff0edc82b5da6"
}
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 1,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}

298
5-Email-data-mining.ipynb Обычный файл
Просмотреть файл

@ -0,0 +1,298 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"5 - Email analysis:\n",
"Below are code examples that will help you parse the single email message file, or whole directory with malicous emails and exctracts various medatata for further use.\n",
"It can be combined with previous codes e.g to exctrat named entities, sentiment or even draw a graph."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Directory and exmaple file selection:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"#----\n",
"directory=\"./emails/\"\n",
"filename=directory+'1.eml'\n",
"#----"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Working example for mailparser library, that you can use for your for further work:\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] [('Оргкомитет', 'news@hrsummit.ru')] <class 'list'>\n",
"[2] []\n",
"[3] set()\n",
"[4] [('', 'karasek.jindrich@gmail.com')]\n",
"[5] ['gmail.com']\n",
"[6] [{'by': 'smtp368.emlone.com', 'id': 'hed7s62erm86', 'for': '<karasek.jindrich@gmail.com>', 'envelope_from': 'postman3496981@justeml.com', 'date': 'Tue, 15 Nov 2022 08:37:23 +0000 envelope-from <postman3496981@justeml.com>', 'hop': 1, 'date_utc': '2022-11-15T08:37:23', 'delay': 0}, {'from': 'smtp573.emlone.com smtp573.emlone.com. 87.246.187.152', 'by': 'mx.google.com', 'with': 'ESMTPS', 'id': 'i1-20020a2ea361000000b0026fbd8bb585si5659533ljn.64.2022.11.15.00.37.23', 'for': '<karasek.jindrich@gmail.com> version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128', 'date': 'Tue, 15 Nov 2022 00:37:23 -0800 PST', 'hop': 2, 'date_utc': '2022-11-15T08:37:23', 'delay': 0.0}, {'by': '2002:a05:6638:59c:b0:375:9b14:11af', 'with': 'SMTP', 'id': 'a28csp3262577jar', 'date': 'Tue, 15 Nov 2022 00:37:24 -0800 PST', 'hop': 3, 'date_utc': '2022-11-15T08:37:24', 'delay': 1.0}]\n",
"[7] None\n",
"[8] [('', 'karasek.jindrich@gmail.com')]\n",
"[9] []\n",
"[10] Один день до CDO/CDTO Summit & Award 17 НОЯБРЯ 2022\n",
"[11] []\n",
"[16] <E1ourRb-P0bP3O-NK@ucs301-ucs-6.msgpanel.com>\n",
"[19] +0.0\n"
]
}
],
"source": [
"import mailparser\n",
"\n",
"#https://pypi.org/project/mail-parser/\n",
"\"\"\"\n",
"mail = mailparser.parse_from_bytes(byte_mail)\n",
"mail = mailparser.parse_from_file(f)\n",
"mail = mailparser.parse_from_file_msg(outlook_mail)\n",
"mail = mailparser.parse_from_file_obj(fp)\n",
"mail = mailparser.parse_from_string(raw_mail)\n",
"\n",
"mail.attachments: list of all attachments\n",
"mail.body\n",
"mail.date: datetime object in UTC\n",
"mail.defects: defect RFC not compliance\n",
"mail.defects_categories: only defects categories\n",
"mail.delivered_to\n",
"mail.from_\n",
"mail.get_server_ipaddress(trust=\"my_server_mail_trust\")\n",
"mail.headers\n",
"mail.mail: tokenized mail in a object\n",
"mail.message: email.message.Message object\n",
"mail.message_as_string: message as string\n",
"mail.message_id\n",
"mail.received\n",
"mail.subject\n",
"mail.text_plain: only text plain mail parts in a list\n",
"mail.text_html: only text html mail parts in a list\n",
"mail.text_not_managed: all not managed text (check the warning logs to find content subtype)\n",
"mail.to\n",
"mail.to_domains\n",
"mail.timezone: returns the timezone, offset from UTC\n",
"mail.mail_partial: returns only the mains parts of emails\n",
" \n",
"#Write attachments on disc\n",
"mail.write_attachments(base_path)\n",
"\"\"\"\n",
"\n",
"\n",
"raw_email=filename\n",
"\n",
"mail = mailparser.parse_from_file(raw_email)\n",
"\n",
"print(\"[1]\",mail.from_,type(mail.from_))\n",
"print(\"[2]\",mail.defects)\n",
"print(\"[3]\",mail.defects_categories)\n",
"print(\"[4]\",mail.to)\n",
"print(\"[5]\",mail.to_domains)\n",
"print(\"[6]\",mail.received)\n",
"print(\"[7]\",mail.get_server_ipaddress(trust=\"my_server_mail_trust\"))\n",
"print(\"[8]\",mail.delivered_to)\n",
"print(\"[9]\",mail.attachments)\n",
"print(\"[10]\",mail.subject)\n",
"print(\"[11]\",mail.text_not_managed)\n",
"#print(\"[12]\",mail.headers)\n",
"#print(\"[13]\",mail.mail_partial)\n",
"#print(\"[14]\",mail.text_plain) #works\n",
"#print(\"[15]\",mail.text_html) #works\n",
"print(\"[16]\",mail.message_id)\n",
"#print(\"[17]\",mail.message_as_string)\n",
"#print(\"[18]\",mail.body)\n",
"print(\"[19]\",mail.timezone)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python functions ready to be incorporated to the analytical framework:\n",
"\"email_miner_text\" is applied to each email file to extract the metadata for the hunting.\n",
"\"email_dir_miner\" is an alternative of above function, in case you have already a pile of eml. files to process and also generates the graph of sender- receiver structure to identify the important topological structures in chain mail campaigns."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[*]Duplicates dropped.\n",
" from \\\n",
"count 20 \n",
"unique 16 \n",
"top [('LOGYTalks - Blockchain & Cryptocurrency Sum... \n",
"freq 2 \n",
"\n",
" mailto \n",
"count 20 \n",
"unique 9 \n",
"top [('', 'karasek.jindrich@gmail.com')] \n",
"freq 8 \n",
"[*] CSV file saved, done!\n"
]
}
],
"source": [
"def email_miner_text(email_file): # takes single eml file and mines the metadata for the analysis.\n",
" import datetime\n",
" import json\n",
" import eml_parser\n",
" import numpy as np\n",
" import pandas as pd\n",
" import os\n",
" import mailparser\n",
" #https://pypi.org/project/mail-parser/\n",
" \"\"\"Dataframe design\"\"\"\n",
" lcolumns=[\"Filename\",\"messageID\",\"from\",\"subject\",\"mailto\",\"mailtod\",\"received\",\"attachments\",\"text-html\",\"text\"]\n",
" ledf=pd.DataFrame(columns=lcolumns)\n",
" try:\n",
" with open(email_file,'rb') as fhdl:\n",
" raw_email=fhdl.read()\n",
" #print(filename)\n",
" #all the code for edf filling comes here\n",
" mail = mailparser.parse_from_file(raw_email)\n",
" data=np.array([str(file),str(mail.message_id),str(mail.from_),str(mail.subject),str(mail.to),str(mail.to_domains),str(mail.received),str(mail.attachments),str(mail.text_html),str(mail.text)])\n",
" #row=pd.Series(data,index=lcolumns)\n",
" #ledf=ledf.append(pd.Series(row),ignore_index=True)\n",
" except:\n",
" pass\n",
" #print(ledf)\n",
" #print(ledf.describe())\n",
" #print(ledf[\"subject\"],ledf[\"Filename\"])\n",
" #deduplicate the edgelist\n",
" #ledf.drop_duplicates(keep='first')\n",
" #save the edgelists to a csv files\n",
" #ledf.to_csv(\"/Users/jindrich_karasek/data-science/disinfo-corpus/maily/ledf.csv\", sep=',', encoding='utf-8')\n",
" #print(mail.message_as_string)\n",
" #print(mail.text_plain)\n",
" return mail.text_plain\n",
"\n",
"def email_dir_miner(path): #Generates the matrix for the graph of the sending emails structure.\n",
" \"\"\"Read ALL directory,TOP DOWN from starting point, select all the *.eml files and parse metadata into the dataframe\"\"\"\n",
" import datetime\n",
" import json\n",
" import eml_parser\n",
" import numpy as np\n",
" import pandas as pd\n",
" import os\n",
" import mailparser\n",
" import tqdm\n",
"\n",
" #https://pypi.org/project/mail-parser/\n",
" \"\"\"Dataframe design\"\"\"\n",
" #lcolumns=[\"Filename\",\"messageID\",\"from\",\"subject\",\"mailto\",\"mailtod\",\"received\",\"attachments\",\"text-html\",\"text\"]\n",
" lcolumns=[\"from\",\"mailto\"]\n",
" ledf=pd.DataFrame(columns=lcolumns)\n",
" \"\"\"iterate over the files and add pd series generated from each into the final datatframe\"\"\"\n",
" #https://www.bogotobogo.com/python/python_traversing_directory_tree_recursively_os_walk.php\n",
" \n",
" i=0\n",
" for pth,subdirs,files in os.walk(path):\n",
" for name in files:\n",
" file=os.path.join(pth,name)\n",
" #print(file)\n",
" filename=str(str(path)+(os.fsdecode(file)))\n",
" #print(file)\n",
" if file.endswith(\".eml\"):\n",
" i=i+1\n",
" #print(i)\n",
" try:\n",
" with open(file,'rb') as fhdl:\n",
" raw_email=fhdl.read()\n",
" #print(filename)\n",
" #all the code for edf filling comes here\n",
" mail = mailparser.parse_from_bytes(raw_email)\n",
" #data=np.array([str(file),str(mail.message_id),str(mail.from_),str(mail.subject),str(mail.to),str(mail.to_domains),str(mail.received),str(mail.attachments),str(mail.text_html)])\n",
" data={\"Filename\":str(file),\"messageID\":str(mail.message_id),\"from\":str(mail.from_),\"subject\":str(mail.subject),\"mailto\":str(mail.to),\"mailtod\":str(mail.to_domains),\"received\":str(mail.received),\"attachments\":str(mail.attachments),\"text-html\":str(mail.text_html),\"text\":str(mail.text)}\n",
" #data={\"from\":str(mail.from_).replace('[','').replace(']','').replace('(','').replace(')',''),\"mailto\":str(mail.to).replace(']','').replace('[','').replace('(','').replace(')','')}\n",
" row=pd.Series(data,index=lcolumns)\n",
" ledf=ledf.append(pd.Series(row),ignore_index=True)\n",
" continue\n",
" except:\n",
" continue \n",
" else:\n",
" continue\n",
" if i > 10000000:\n",
" break\n",
" \n",
" #Turning ledf into the dataframe-list of emails:\n",
" email_df1=ledf[\"from\"]\n",
" email_df2=ledf[\"mailto\"]\n",
" \n",
" email_df=email_df1.append(email_df2)\n",
" \n",
" #deduplicate the edgelist\n",
" ledf.drop_duplicates(keep='first')\n",
" email_df.drop_duplicates(keep='first')\n",
" print(\"[*]Duplicates dropped.\")\n",
" #print(edf)\n",
" print(ledf.describe())\n",
" #print(ledf[\"subject\"],ledf[\"from\"])\n",
" #save the edgelists to a csv files\n",
" ledf_dir=directory+\"edf.csv\"\n",
" ledf.to_csv(ledf_dir, sep=',', encoding='utf-8')\n",
" ledf_dir2=directory+\"email_df.csv\"\n",
" email_df.to_csv(ledf_dir2,sep=',',index=False,header=False)\n",
" print(\"[*] CSV file saved, done!\")\n",
"\n",
"#------------------------------\n",
"\n",
"email_dir_miner(\"./emails/\")\n",
"#print(\"Output of email_miner_text(): \",\"\\n\",email_miner_text(raw_email))\n",
"#print(\"Output of email_miner_data(): \",\"\\n\",email_miner_data(raw_email))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.8 ('eolas-py3.8.8')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
},
"vscode": {
"interpreter": {
"hash": "7217508cf40a866d2c6d8c05c8a287a7af39b44bc942772df40ff0edc82b5da6"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}

342
topic-modelling.ipynb Обычный файл
Просмотреть файл

@ -0,0 +1,342 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1 - Topic modeling:\n",
"\n",
"This technique allows you for automated assessment of the text content and semantics. More reading: https://en.wikipedia.org/wiki/Topic_model\n",
"You may use this for large scale screening to determine oddities in website contents, emails, tweets, discussion forums or even social networks.\n",
"Each of the data sources requires specific dataminig approach. \n",
"\n",
"In this notebook, you will analyze data obtained from Twitter firehose api.\n",
"https://developer.twitter.com/en/docs/twitter-api/enterprise/compliance-firehose-api/overview\n",
"The advantage of working with this API is that you can request the access as a research or government body and get much more data, compared to privat API access (https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api.)\n",
"\n",
"However, you can use this Jupyter Notebook to process any texts you need.\n",
"Depending on the context of your data colelction, you are able to spot forst 3 phases of the disinformation killchain just based on visualization of the topics.\n",
"\n",
"That is:\n",
"Recon - When you see that certain topic suddenly resonates within the sampling space. When sampling is repeated to include the increments, there will be minor clusters around the initial structures.\n",
"Build - Clusters will be larger and new entities will appear to interact. \n",
"Seed - Similar cluster structures starts to appear in data from multiple sources.\n",
"\n",
"(Copy - Signifficant growth in cluster sizes and entity numbers per monitored info - space. Note that visibility of this phase depends on the method of sampling and may not occur if the sampling rate is too low)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use your own datsets if you modify the cell with the directory path below:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#----\n",
"directory=\"./dataset-kherson/kherson-11-2022/all-lang/\"\n",
"#----"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code below is set up for you so that you do not have to change anything. \n",
"Simply run each cell and see the output. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 0%| | 0/73 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Consolidation of the files into one dataframe...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 73/73 [00:38<00:00, 1.91it/s]\n"
]
}
],
"source": [
"import json\n",
"import gzip\n",
"import pandas as pd\n",
"from tqdm import tqdm\n",
"import os\n",
"\n",
"df_ac=pd.DataFrame()\n",
"print(\"Consolidation of the files into one dataframe...\")\n",
"for fil in tqdm(os.listdir(directory)):\n",
" flen=len(os.listdir(directory))\n",
" file=fil\n",
" filename=str(str(directory)+(os.fsdecode(fil)))\n",
" if filename.endswith(\".gz\"):\n",
" try:\n",
" df_act = pd.read_json(filename,lines=True)\n",
" df_ac=df_ac.append(df_act)\n",
" except:\n",
" continue\n",
" continue\n",
" else:\n",
" continue\n",
"\n",
"print(\"Dataframe generated!\")\n",
"#print(\"Dataframe size is :\", df_ac.size)\n",
"#print(df_ac.head())\n",
"#print(df_ac.columns)\n",
"#print(df_ac[[\"lang\"]].head(5))\n",
"#print(df_ac[\"text\"].head(2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset you are working with has also information about the language of the tweet. It is often off, but can be used as a raw determination of \"language space\" of the tweet.\n",
"If you sort tweets but language, you can get insight to cultural / geopolitical context and also see the differences of the threat actor activity. \n",
"You can see for yourself - the code was prerun and you can see the differences in the languages and the results are stored in html files.\n",
"That means, you can open and view the files in any web browser.\n",
"\n",
"The dataset folder: \n",
"\n",
"\"/Documents/decode-cw/1-Topic_modeling/dataset-kherson/kherson-11-2022/all-lang/\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"lang\n",
"en 232189\n",
"fr 13072\n",
"es 11864\n",
"it 6394\n",
"de 6094\n",
"cy 3977\n",
"pl 3609\n",
"und 2749\n",
"cs 2135\n",
"nl 2113\n",
"pt 2004\n",
"uk 1555\n",
"fi 1394\n",
"tr 1261\n",
"qme 1212\n",
"dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Language statistics:\n",
"\n",
"df_ac[[\"lang\"]].value_counts().head(15)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can change the desired language of the tweets, or you can leave the default \"en\" (as for \"English\") value in the code below:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"LANG=\"en\""
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"lang\n",
"en 232189\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Language filter:\n",
"df_ac=df_ac.loc[df_ac[\"lang\"]==LANG]\n",
"df_ac[[\"lang\"]].value_counts().head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Corpus saving to the disc: \n",
"df_ac['text'].to_csv(r'./dataset-kherson/kherson-11-2022/all-lang/corpora-en.txt', header=None, index=None, sep=' ', mode='a')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code below does the actual topic modeling. Further reading with more examples below:\n",
"\n",
"https://spacy.io/universe/project/bertopic;\n",
"https://datascience.stackexchange.com/questions/108178/how-to-prepare-texts-to-bert-roberta-models;\n",
"https://albertauyeung.github.io/2020/06/19/bert-tokenization.html/;\n",
"https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html;\n",
"https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-documents"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Timestamp('2022-11-09 00:58:38+0000', tz='UTC'), Timestamp('2022-11-09 00:14:54+0000', tz='UTC'), Timestamp('2022-11-11 00:11:39+0000', tz='UTC'), Timestamp('2022-11-11 00:13:17+0000', tz='UTC'), Timestamp('2022-11-11 00:13:53+0000', tz='UTC')]\n",
"['rt kherson was is and always will be ukraine godspeed to the afu', 'rt alex debrincat to drake batherson starts this game off quickly way to get on the board sens gosensgo', 'rt the armed forces completely surrounded kherson', 'rt ukrainian forces entering kherson right now', 'rt a short timelapse of the progress by towards kherson over the past few days the changes only show the verified advanc']\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Batches: 100%|██████████| 7256/7256 [01:57<00:00, 61.79it/s] \n",
"2022-12-01 16:29:05,240 - BERTopic - Transformed documents to Embeddings\n",
"2022-12-01 16:38:04,504 - BERTopic - Reduced dimensionality\n",
"2022-12-01 16:38:25,271 - BERTopic - Clustered reduced embeddings\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Topic modeling finished, go to the ./dataset-kherson/kherson-11-2022/all-lang/ to inspect the results in html formate.\n"
]
}
],
"source": [
"import re\n",
"import pandas as pd\n",
"from bertopic import BERTopic\n",
"\n",
"tweets=df_ac\n",
"\n",
"tweets.text = tweets.apply(lambda row: re.sub(r\"http\\S+\", \"\", row.text).lower(), 1)\n",
"tweets.text = tweets.apply(lambda row: \" \".join(filter(lambda x:x[0]!=\"@\", row.text.split())), 1)\n",
"tweets.text = tweets.apply(lambda row: \" \".join(re.sub(\"[^a-zA-Z]+\", \" \", row.text).split()), 1)\n",
"timestamps = tweets.created_at.to_list()\n",
"print(timestamps[:5])\n",
"tweets = tweets.text.to_list()\n",
"print(tweets[:5])\n",
"\n",
"# Create topics over time - only for the data with timestamps in standardised formate\n",
"model = BERTopic(verbose=True)\n",
"topics, probs = model.fit_transform(tweets)\n",
"#topics_over_time = model.topics_over_time(tweets, topics, timestamps)\n",
"#topics_over_time2 = model.topics_over_time(tweets, topics, timestamps, nr_bins=20)\n",
"#topics_over_time2 = model.topics_over_time(tweets, topics, timestamps)\n",
"\n",
"#Save the model:\n",
"#model.save(\"Bertopic_model\")\n",
"#Load the model:\n",
"#BERTopic.load(\"Bertopic_model\")\n",
"\n",
"#https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.visualize_topics\n",
"#figx=model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91]) #selection of particular topics\n",
"#figx.show()\n",
"#figx.write_html(directory+\"topics_over_time.html\")\n",
"fig = model.visualize_topics()\n",
"#fig.show()\n",
"fig.write_html(directory+\"visualised_topics.html\")\n",
"fig2=model.visualize_barchart()\n",
"#fig2.show()\n",
"fig2.write_html(directory+\"visualise_barchart.html\")\n",
"fig3=model.visualize_hierarchy()\n",
"#fig3.show()\n",
"fig3.write_html(directory+\"visualise_hierarchy.html\")\n",
"#fig4=model.visualize_topics_over_time(topics_over_time2, top_n_topics=25)\n",
"#fig4.show()\n",
"#fig4.write_html(directory+\"visualize_topics_over_time.html\")\n",
"\n",
"print(\"Topic modeling finished, go to the\"+\" \"+ directory+\" \"+ \"to inspect the results in html formate.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The files are too complex to be loaded here, especialy in case of larger datasets.\n",
"You can open the outputs in the browser, go to this folder: \"/Documents/decode-cw/1-Topic_modeling/dataset-kherson/kherson-11-2022/all-lang\""
]
}
],
"metadata": {
"celltoolbar": "Raw Cell Format",
"kernelspec": {
"display_name": "Python 3.8.8 ('eolas-py3.8.8')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
},
"vscode": {
"interpreter": {
"hash": "7217508cf40a866d2c6d8c05c8a287a7af39b44bc942772df40ff0edc82b5da6"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}

1053
twitter-sentiment-analysis-Single-User.ipynb Обычный файл

Различия файлов скрыты, потому что одна или несколько строк слишком длинны