* add bertopic to requirements

* basic topic modeling

* topic modeling using BERT; bugfix if no text on post

* update for google colab

* Catch connection errors

* replace newline character with space

* move topic analysis into PostprocessText class

* set up dataflow topic analysis

* expose topic model to UI

* tests for class init

* tests for topic analysis

* more tests

* take care of carriage return on windows

* take care of carriage return on windows

* take care of carriage return on windows

* set encoding to ensure windows compatibility

* track encoding error

* more debug

* skip topic analysis debug

* windows fixes
Этот коммит содержится в:
Inga Ulusoy 2023-02-13 02:45:24 -08:00 коммит произвёл GitHub
родитель 0a017b10ec
Коммит a6578cfdf3
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
8 изменённых файлов: 585 добавлений и 24 удалений

52
misinformation/test/data/test_data_out.csv Обычный файл
Просмотреть файл

@ -0,0 +1,52 @@
,filename,text,text_language,text_english
0,./test/data/IMG_3755.jpg,,,"Mathematische Formelsammlung
für Ingenieure und Naturwissenschaftler
Mit zahlreichen Abbildungen und Rechenbeispielen
und einer ausführlichen Integraltafel
3., verbesserte Auflage"
1,./test/data/IMG_3756.jpg,"SCATTERING THEORY
The Quantum Theory of
Nonrelativistic Collisions
JOHN R. TAYLOR
University of Colorado
ostaliga Lanbidean
1 ilde
ballenger stor goin
gdĐOL, SIVI 23 TL 02
de in obl
och yd badalang
a
Ber
ook Sy-RW enot go baldus",om,"SCATTERING THEORY
The Quantum Theory of
Nonrelativistic Collisions
JOHN R. TAYLOR
University of Colorado
ostaliga Lanbidean
1 ilde
balloons big goin
gdĐOL, SIVI 23 TL
there in obl
och yd change
a
Ber
ook Sy-RW isn't going anywhere"
2,./test/data/IMG_3757.jpg,"THE
ALGEBRAIC
EIGENVALUE
PROBLEM
DOM
NVS TIO
MINA
Monographs
on Numerical Analysis
J.. H. WILKINSON",en,"THE
ALGEBRAIC
EIGENVALUE
PROBLEM
DOM
NVS TIO
MINA
Monographs
on Numerical Analysis
J.. H. WILKINSON"
1 filename text text_language text_english
2 0 ./test/data/IMG_3755.jpg Mathematische Formelsammlung für Ingenieure und Naturwissenschaftler Mit zahlreichen Abbildungen und Rechenbeispielen und einer ausführlichen Integraltafel 3., verbesserte Auflage
3 1 ./test/data/IMG_3756.jpg SCATTERING THEORY The Quantum Theory of Nonrelativistic Collisions JOHN R. TAYLOR University of Colorado ostaliga Lanbidean 1 ilde ballenger stor goin gdĐOL, SIVI 23 TL 02 de in obl och yd badalang a Ber ook Sy-RW enot go baldus om SCATTERING THEORY The Quantum Theory of Nonrelativistic Collisions JOHN R. TAYLOR University of Colorado ostaliga Lanbidean 1 ilde balloons big goin gdĐOL, SIVI 23 TL there in obl och yd change a Ber ook Sy-RW isn't going anywhere
4 2 ./test/data/IMG_3757.jpg THE ALGEBRAIC EIGENVALUE PROBLEM DOM NVS TIO MINA Monographs on Numerical Analysis J.. H. WILKINSON en THE ALGEBRAIC EIGENVALUE PROBLEM DOM NVS TIO MINA Monographs on Numerical Analysis J.. H. WILKINSON

Просмотреть файл

@ -0,0 +1,52 @@
,filename,text,text_language,text_nglish
0,./test/data/IMG_3755.jpg,,,"Mathematische Formelsammlung
für Ingenieure und Naturwissenschaftler
Mit zahlreichen Abbildungen und Rechenbeispielen
und einer ausführlichen Integraltafel
3., verbesserte Auflage"
1,./test/data/IMG_3756.jpg,"SCATTERING THEORY
The Quantum Theory of
Nonrelativistic Collisions
JOHN R. TAYLOR
University of Colorado
ostaliga Lanbidean
1 ilde
ballenger stor goin
gdĐOL, SIVI 23 TL 02
de in obl
och yd badalang
a
Ber
ook Sy-RW enot go baldus",om,"SCATTERING THEORY
The Quantum Theory of
Nonrelativistic Collisions
JOHN R. TAYLOR
University of Colorado
ostaliga Lanbidean
1 ilde
balloons big goin
gdĐOL, SIVI 23 TL
there in obl
och yd change
a
Ber
ook Sy-RW isn't going anywhere"
2,./test/data/IMG_3757.jpg,"THE
ALGEBRAIC
EIGENVALUE
PROBLEM
DOM
NVS TIO
MINA
Monographs
on Numerical Analysis
J.. H. WILKINSON",en,"THE
ALGEBRAIC
EIGENVALUE
PROBLEM
DOM
NVS TIO
MINA
Monographs
on Numerical Analysis
J.. H. WILKINSON"
1 filename text text_language text_nglish
2 0 ./test/data/IMG_3755.jpg Mathematische Formelsammlung für Ingenieure und Naturwissenschaftler Mit zahlreichen Abbildungen und Rechenbeispielen und einer ausführlichen Integraltafel 3., verbesserte Auflage
3 1 ./test/data/IMG_3756.jpg SCATTERING THEORY The Quantum Theory of Nonrelativistic Collisions JOHN R. TAYLOR University of Colorado ostaliga Lanbidean 1 ilde ballenger stor goin gdĐOL, SIVI 23 TL 02 de in obl och yd badalang a Ber ook Sy-RW enot go baldus om SCATTERING THEORY The Quantum Theory of Nonrelativistic Collisions JOHN R. TAYLOR University of Colorado ostaliga Lanbidean 1 ilde balloons big goin gdĐOL, SIVI 23 TL there in obl och yd change a Ber ook Sy-RW isn't going anywhere
4 2 ./test/data/IMG_3757.jpg THE ALGEBRAIC EIGENVALUE PROBLEM DOM NVS TIO MINA Monographs on Numerical Analysis J.. H. WILKINSON en THE ALGEBRAIC EIGENVALUE PROBLEM DOM NVS TIO MINA Monographs on Numerical Analysis J.. H. WILKINSON

Просмотреть файл

@ -0,0 +1,190 @@
text_english
Mercury: Retrograde
Pathology
Symbiote
ProductOfDrugs (Prod. The Virus and Antidote)
Venom
Gatteka
kamikaze (+ pulse)
T.R.U. (Totally Rotten Underground)
I Put My Dick in Your Mental
Andromeda
BRAINFOOD
Troll Under the Bridge
1000 Rounds
Sacrifice
Backpack
D(R)Own
"Okay
TakingOutTheTrash
Io sono qui
Paris
Murder
High 'N Mighty
Euronymous
Hades
Nails
Squeeze
No Teeth
Bang Ya Fucking Head
BLUE JUICE
Loch Ness
Hold Uh
Bone Saw
Coffin Wave
OhNo!
TheArtOfCremation
OakGroveRoad
WhatWasThat
FunnyToSeeYouHere
John Dee
Kybalion
Killer
608
Eternal Dreams
Nightmare Choir (I Been Asleep Too Long)
Exodus
Vengeance
Claustrophobia
Rearranged
Paralax
Exsanguination
Mutiny
Centipede
Грустная сука
This World Is Sick
Пламя
2:45
who is he
Sleeping
Timeless
Pound for Pound
Finger Trembling
Overload
Kill Yourself (Part III)
2nd Hand
Antarctica
Memoirs Of A Gorilla
Runnin' Thru The 7th With My Woadies
Mount Sinai
FUCKTHEPOPULATION
Magazine
2 Hot 4 U (feat. $Uicdeboy$)
O Pana!
LTE
Champion Of Death
Seppuku (feat. Suicideboy$ & Jgrxxn)
You're Now Tuning Into 66.6 FM With DJ Rapture (The Hottest Hour Of The Evening)
Slip On A Banana Clip
A Death In The Ocean Would Be Beautiful
Shattered Amethyst
Goosebumps
Venom
Bury Me
Hack Slash
2000 Rounds
Sea Sick
Grain
"Beware
Kali Yuga
Hexada
Caligula
Niagara (feat. Lil Peep)
Scrying Through Shattered Glass
Polaris
Rapture
Blackmage
Tartarus
Until the Light Takes Us
As Above so Look out Below
Swan
Sneak Diss (feat. So6ix)
Plague Doctor Mask
Some of Us May Never See the World
Filth
Homecoming
Blood
Sweat
Tears
Anabolic
HDMI
Dirt
Oxygen
Branches
CtrlAltDelete
BlastZone (ЗонаПоражения)
CharacterSelect (ВыборПерсонажа)
RestInPeace (Prod. by The Virus And Antidote)
BlackMold
Toxin
Electric
Cranium
Friday
Hooky
Kalaxian Crystals
Slurp
BROKE ft. Prohibeo
Lies
Terry McGinnis
Gremlin
Giant Squit
You Are Not Like Us
Arachnids
Give Ah Fuck
Death Wish
Allergies
Cut Throat
Memoirs of a Gorilla
Benz Truck (гелик)
Norf Norf
Dat $tick
"RAF (feat. A$AP Rocky
Crazy
Still Cold / Pathway Private
The Chills
Slip on a Banana Clip
Lights
Akina Speed Star
Big Fish
The Bodies Fall Just Like the Leaves
Story: No Title
P.S Fuck You Cunt (feat. Lil Peep)
Torch
"Buff Squad (feat. Pouya
Sarcophagus III (feat. $Uicideboy$)
Virginia Tech
Lte
Fuckthepopulation
Gloss of Blood
100K
Dark Light
"But Wait
Great Influence
It Don't Matter
absolute in doubt
Boss
Look at Me Now
Bulletproof
Contraband
Deira City Centre
Kyoto
Pull Out Game
Bird Is The Word
Life Is Short
Here We Go Again
Bloodshed
Wassup Bro!
ACT 2 - BirthOfTheSpaceGod
Grey Tee
Sleeping Bag
Afterlife
King Cobra (Drippin')
Heart Attack
Chain$Aw
"King
P.T.S.D
Brand New
Jukai
Philosopher's Throne
PRBLMS
Back At It
Не удается отобразить этот файл, потому что он содержит неожиданный символ в строке 88 и столбце 1.

Просмотреть файл

@ -2,6 +2,8 @@ import os
import pytest
import spacy
import misinformation.text as tt
import misinformation
import pandas as pd
TESTDICT = {
"IMG_3755": {
@ -29,7 +31,6 @@ def test_TextDetector():
assert test_obj.subdict["text_language"] is None
assert test_obj.subdict["text_english"] is None
assert not test_obj.analyse_text
assert not test_obj.analyse_topic
@pytest.mark.gcv
@ -39,7 +40,6 @@ def test_analyse_image():
test_obj.analyse_image()
test_obj = tt.TextDetector(TESTDICT[item], analyse_text=True)
test_obj.analyse_image()
test_obj = tt.TextDetector(TESTDICT[item], analyse_topic=True)
@pytest.mark.gcv
@ -68,6 +68,15 @@ def test_translate_text():
assert test_obj.subdict["text_english"] == translated_text
def test_remove_linebreaks():
test_obj = tt.TextDetector({})
test_obj.subdict["text"] = "This is \n a test."
test_obj.subdict["text_english"] = "This is \n another\n test."
test_obj.remove_linebreaks()
assert test_obj.subdict["text"] == "This is a test."
assert test_obj.subdict["text_english"] == "This is another test."
def test_run_spacy():
test_obj = tt.TextDetector(TESTDICT["IMG_3755"], analyse_text=True)
ref_file = "./test/data/text_IMG_3755.txt"
@ -106,3 +115,34 @@ def test_sentiment_analysis():
test_obj.sentiment_analysis()
assert test_obj.subdict["polarity"] == 0.5
assert test_obj.subdict["subjectivity"] == 0.6
def test_PostprocessText():
reference_dict = "THE\nALGEBRAIC\nEIGENVALUE\nPROBLEM\nDOM\nNVS TIO\nMINA\nMonographs\non Numerical Analysis\nJ.. H. WILKINSON"
reference_df = "Mathematische Formelsammlung\nfür Ingenieure und Naturwissenschaftler\nMit zahlreichen Abbildungen und Rechenbeispielen\nund einer ausführlichen Integraltafel\n3., verbesserte Auflage"
obj = tt.PostprocessText(mydict=TESTDICT)
# make sure test works on windows where end-of-line character is \r\n
test_dict = obj.list_text_english[2].replace("\r", "")
assert test_dict == reference_dict
for key in TESTDICT.keys():
TESTDICT[key].pop("text_english")
with pytest.raises(ValueError):
tt.PostprocessText(mydict=TESTDICT)
obj = tt.PostprocessText(use_csv=True, csv_path="./test/data/test_data_out.csv")
# make sure test works on windows where end-of-line character is \r\n
test_df = obj.list_text_english[0].replace("\r", "")
assert test_df == reference_df
with pytest.raises(ValueError):
tt.PostprocessText(use_csv=True, csv_path="./test/data/test_data_out_nokey.csv")
with pytest.raises(ValueError):
tt.PostprocessText()
def test_analyse_topic():
_, topic_df, most_frequent_topics = tt.PostprocessText(
use_csv=True, csv_path="./test/data/topic_analysis_test.csv"
).analyse_topic()
# since this is not deterministic we cannot be sure we get the same result twice
assert len(topic_df) == 2
assert topic_df["Name"].iloc[0] == "0_the_feat_of_is"
assert most_frequent_topics[0][0][0] == "the"

Просмотреть файл

@ -6,6 +6,9 @@ from textblob import TextBlob
from textblob import download_corpora
import io
from misinformation import utils
import grpc
import pandas as pd
from bertopic import BERTopic
# make widgets work again
# clean text has weird spaces and separation of "do n't"
@ -13,14 +16,11 @@ from misinformation import utils
class TextDetector(utils.AnalysisMethod):
def __init__(
self, subdict: dict, analyse_text: bool = False, analyse_topic: bool = False
) -> None:
def __init__(self, subdict: dict, analyse_text: bool = False) -> None:
super().__init__(subdict)
self.subdict.update(self.set_keys())
self.translator = Translator()
self.analyse_text = analyse_text
self.analyse_topic = analyse_topic
if self.analyse_text:
self._initialize_spacy()
self._initialize_textblob()
@ -46,13 +46,12 @@ class TextDetector(utils.AnalysisMethod):
def analyse_image(self):
self.get_text_from_image()
self.translate_text()
self.remove_linebreaks()
if self.analyse_text:
self._run_spacy()
self.clean_text()
self.correct_spelling()
self.sentiment_analysis()
if self.analyse_topic:
self.analyse_topic()
return self.subdict
def get_text_from_image(self):
@ -62,12 +61,19 @@ class TextDetector(utils.AnalysisMethod):
with io.open(path, "rb") as image_file:
content = image_file.read()
image = vision.Image(content=content)
response = client.text_detection(image=image)
texts = response.text_annotations[0].description
# here check if text was found
if texts:
# check for usual connection errors and retry if necessary
try:
response = client.text_detection(image=image)
except grpc.RpcError as exc:
print("Cloud vision API connection failed")
print("Skipping this image ..{}".format(path))
print("Connection failed with code {}: {}".format(exc.code(), exc))
# here check if text was found on image
if response:
texts = response.text_annotations[0].description
self.subdict["text"] = texts
if response.error.message:
print("Google Cloud Vision Error")
raise ValueError(
"{}\nFor more info on error messages, check: "
"https://cloud.google.com/apis/design/errors".format(
@ -80,6 +86,14 @@ class TextDetector(utils.AnalysisMethod):
self.subdict["text_language"] = translated.src
self.subdict["text_english"] = translated.text
def remove_linebreaks(self):
"""Remove linebreaks from original and translated text."""
if self.subdict["text"]:
self.subdict["text"] = self.subdict["text"].replace("\n", " ")
self.subdict["text_english"] = self.subdict["text_english"].replace(
"\n", " "
)
def _run_spacy(self):
"""Generate spacy doc object."""
self.doc = self.nlp(self.subdict["text_english"])
@ -105,5 +119,73 @@ class TextDetector(utils.AnalysisMethod):
# where 0.0 is very objective and 1.0 is very subjective
self.subdict["subjectivity"] = self.doc._.blob.subjectivity
def analyse_topic(self):
pass
class PostprocessText:
def __init__(
self, mydict: dict = None, use_csv: bool = False, csv_path: str = None
) -> None:
self.use_csv = use_csv
if mydict:
print("Reading data from dict.")
self.mydict = mydict
self.list_text_english = self.get_text_dict()
elif self.use_csv:
print("Reading data from df.")
self.df = pd.read_csv(csv_path, encoding="utf8")
self.list_text_english = self.get_text_df()
else:
raise ValueError(
"Please provide either dictionary with textual data or \
a csv file by setting `use_csv` to True and providing a \
`csv_path`."
)
def analyse_topic(self, return_topics: int = 3):
"""Topic analysis using BERTopic."""
# load spacy pipeline
nlp = spacy.load(
"en_core_web_md",
exclude=["tagger", "parser", "ner", "attribute_ruler", "lemmatizer"],
)
try:
# unfortunately catching exceptions does not work here - need to figure out why
self.topic_model = BERTopic(embedding_model=nlp)
except TypeError:
print("BERTopic excited with an error - maybe your dataset is too small?")
self.topics, self.probs = self.topic_model.fit_transform(self.list_text_english)
# return the topic list
topic_df = self.topic_model.get_topic_info()
# return the most frequent return_topics
most_frequent_topics = []
if len(topic_df) < return_topics:
print("You requested more topics than are identified in your dataset -")
print(
"Returning only {} topics as these are all that have been found.".format(
len(topic_df)
)
)
for i in range(min(return_topics, len(topic_df))):
most_frequent_topics.append(self.topic_model.get_topic(i))
return self.topic_model, topic_df, most_frequent_topics
def get_text_dict(self):
# use dict to put text_english in list
list_text_english = []
for key in self.mydict.keys():
if "text_english" not in self.mydict[key]:
raise ValueError(
"Please check your provided dictionary - \
no english text data found."
)
list_text_english.append(self.mydict[key]["text_english"])
return list_text_english
def get_text_df(self):
# use csv file to obtain dataframe and put text_english in list
# check that "text_english" is there
if "text_english" not in self.df:
raise ValueError(
"Please check your provided dataframe - \
no english text data found."
)
return self.df["text_english"].tolist()

157
notebooks/get-text-from-image.ipynb сгенерированный
Просмотреть файл

@ -42,7 +42,18 @@
"import os\n",
"from IPython.display import Image, display\n",
"import misinformation\n",
"import tensorflow as tf\n",
"\n",
"print(tf.config.list_physical_devices(\"GPU\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27675810",
"metadata": {},
"outputs": [],
"source": [
"# download the models if they are not there yet\n",
"!python -m spacy download en_core_web_md\n",
"!python -m textblob.download_corpora"
@ -55,9 +66,7 @@
"metadata": {},
"outputs": [],
"source": [
"images = misinformation.find_files(\n",
" path=\"drive/MyDrive/misinformation-data/\", limit=1000\n",
")"
"images = misinformation.find_files(path=\"../data/all/\", limit=1000)"
]
},
{
@ -78,7 +87,7 @@
"metadata": {},
"outputs": [],
"source": [
"mydict = misinformation.utils.initialize_dict(images[0:10])"
"mydict = misinformation.utils.initialize_dict(images[0:3])"
]
},
{
@ -99,7 +108,7 @@
"source": [
"os.environ[\n",
" \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
"] = \"drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\""
"] = \"../data/misinformation-campaign-981aa55a3b13.json\""
]
},
{
@ -180,13 +189,143 @@
"outputs": [],
"source": [
"# Write the csv\n",
"df.to_csv(\"drive/MyDrive/misinformation-data/data_out.csv\")"
"df.to_csv(\"./data_out.csv\")"
]
},
{
"cell_type": "markdown",
"id": "4bc8ac0a",
"metadata": {},
"source": [
"# Topic analysis\n",
"The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
]
},
{
"cell_type": "markdown",
"id": "4931941b",
"metadata": {},
"source": [
"BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
"### Option 1: Use the dictionary as obtained from the above analysis."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "568537df",
"id": "a3450a61",
"metadata": {},
"outputs": [],
"source": [
"# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
"topic_model, topic_df, most_frequent_topics = misinformation.text.PostprocessText(\n",
" mydict=mydict\n",
").analyse_topic()"
]
},
{
"cell_type": "markdown",
"id": "95667342",
"metadata": {},
"source": [
"### Option 2: Read in a csv\n",
"Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5530e436",
"metadata": {},
"outputs": [],
"source": [
"input_file_path = \"data_out.csv\"\n",
"topic_model, topic_df, most_frequent_topics = misinformation.text.PostprocessText(\n",
" use_csv=True, csv_path=input_file_path\n",
").analyse_topic(return_topics=10)"
]
},
{
"cell_type": "markdown",
"id": "0b6ef6d7",
"metadata": {},
"source": [
"### Access frequent topics\n",
"A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f",
"metadata": {},
"outputs": [],
"source": [
"print(topic_df)"
]
},
{
"cell_type": "markdown",
"id": "b3316770",
"metadata": {},
"source": [
"### Get information for specific topic\n",
"The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db14fe03",
"metadata": {},
"outputs": [],
"source": [
"for topic in most_frequent_topics:\n",
" print(\"Topic:\", topic)"
]
},
{
"cell_type": "markdown",
"id": "d10f701e",
"metadata": {},
"source": [
"### Topic visualization\n",
"The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2331afe6",
"metadata": {},
"outputs": [],
"source": [
"topic_model.visualize_topics()"
]
},
{
"cell_type": "markdown",
"id": "f4eaf353",
"metadata": {},
"source": [
"### Save the model\n",
"The model can be saved for future use."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e5e8377c",
"metadata": {},
"outputs": [],
"source": [
"topic_model.save(\"misinfo_posts\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7c94edb9",
"metadata": {},
"outputs": [],
"source": []
@ -194,7 +333,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "misinf",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@ -208,7 +347,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0]"
"version": "3.10.6"
},
"vscode": {
"interpreter": {

Просмотреть файл

@ -45,6 +45,9 @@ dependencies = [
"jupyterlab",
"spacytextblob",
"textblob",
"bertopic",
"grpcio",
"pandas",
]
[project.scripts]

Просмотреть файл

@ -22,4 +22,7 @@ https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.1
jupyterlab
spacytextblob
textblob
git+https://github.com/sloria/TextBlob.git@dev
git+https://github.com/sloria/TextBlob.git@dev
bertopic
grpcio
pandas