зеркало из
https://github.com/ssciwr/AMMICO.git
synced 2025-10-29 13:06:04 +02:00
update docs and google auth (#63)
* update notebooks and google auth * update readme and text * google cred * update secret name * add pandocto CI * pandoc step * install pandoc * correct typo
Этот коммит содержится в:
родитель
6eff8b8145
Коммит
0ca9366980
7
.github/workflows/docs.yml
поставляемый
7
.github/workflows/docs.yml
поставляемый
@ -21,6 +21,13 @@ jobs:
|
||||
run: |
|
||||
pip install -e .
|
||||
python -m pip install -r requirements-dev.txt
|
||||
- name: set google auth
|
||||
uses: 'google-github-actions/auth@v0.4.0'
|
||||
with:
|
||||
credentials_json: '${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}'
|
||||
- name: get pandoc
|
||||
run: |
|
||||
sudo apt-get install -y pandoc
|
||||
- name: Build documentation
|
||||
run: |
|
||||
cd docs
|
||||
|
||||
10
README.md
10
README.md
@ -65,8 +65,18 @@ The extracted text is then stored under the `text` key (column when exporting a
|
||||
|
||||
If you further want to analyse the text, you have to set the `analyse_text` keyword to `True`. In doing so, the text is then processed using [spacy](https://spacy.io/) (tokenized, part-of-speech, lemma, ...). The English text is cleaned from numbers and unrecognized words (`text_clean`), spelling of the English text is corrected (`text_english_correct`), and further sentiment and subjectivity analysis are carried out (`polarity`, `subjectivity`). The latter two steps are carried out using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html). For more information on the sentiment analysis using TextBlob see [here](https://towardsdatascience.com/my-absolute-go-to-for-sentiment-analysis-textblob-3ac3a11d524).
|
||||
|
||||
### Content extraction
|
||||
|
||||
The image content ("caption") is extracted using the [LAVIS](https://github.com/salesforce/LAVIS) library. This library enables vision intelligence extraction using several state-of-the-art models, depending on the task. Further, it allows feature extraction from the images, where users can input textual and image queries, and the images in the database are matched to that query (multimodal search). Another option is question answering, where the user inputs a text question and the library finds the images that match the query.
|
||||
|
||||
### Emotion recognition
|
||||
|
||||
Emotion recognition is carried out using the [deepface](https://github.com/serengil/deepface) and [retinaface](https://github.com/serengil/retinaface) libraries. These libraries detect the presence of faces, and their age, gender, emotion and race based on several state-of-the-art models. It is also detected if the person is wearing a face mask - if they are, then no further detection is carried out as the mask prevents an accurate prediction.
|
||||
|
||||
### Object detection
|
||||
|
||||
Object detection is carried out using [cvlib](https://github.com/arunponnusamy/cvlib) and the [YOLOv4](https://github.com/AlexeyAB/darknet) model. This library detects faces, people, and several inanimate objects; we currently have restricted the output to person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, cell phone.
|
||||
|
||||
### Cropping of posts
|
||||
|
||||
Social media posts can automatically be cropped to remove further comments on the page and restrict the textual content to the first comment only.
|
||||
@ -12,7 +12,7 @@ sys.path.insert(0, os.path.abspath("../../misinformation/"))
|
||||
# -- Project information -----------------------------------------------------
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
|
||||
|
||||
project = "misinformation"
|
||||
project = "AMMICO"
|
||||
copyright = "2022, Scientific Software Center, Heidelberg University"
|
||||
author = "Scientific Software Center, Heidelberg University"
|
||||
release = "0.0.1"
|
||||
|
||||
@ -3,8 +3,8 @@
|
||||
You can adapt this file completely to your liking, but it should at least
|
||||
contain the root `toctree` directive.
|
||||
|
||||
Welcome to misinformation's documentation!
|
||||
==========================================
|
||||
Welcome to AMMICO's documentation!
|
||||
==================================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
@ -12,6 +12,10 @@ Welcome to misinformation's documentation!
|
||||
|
||||
readme_link
|
||||
notebooks/Example faces
|
||||
notebooks/Example text
|
||||
notebooks/Example summary
|
||||
notebooks/Example multimodal
|
||||
notebooks/Example objects
|
||||
modules
|
||||
license_link
|
||||
|
||||
|
||||
@ -1,5 +1,5 @@
|
||||
misinformation package modules
|
||||
==============================
|
||||
AMMICO package modules
|
||||
======================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 4
|
||||
|
||||
Двоичные данные
docs/source/notebooks/3_blip_saved_features_image.pt
Двоичные данные
docs/source/notebooks/3_blip_saved_features_image.pt
Двоичный файл не отображается.
@ -40,8 +40,8 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"images = mutils.find_files(\n",
|
||||
" path=\"../misinformation/test/data/\",\n",
|
||||
" limit=1000,\n",
|
||||
" path=\"data/\",\n",
|
||||
" limit=10,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@ -51,16 +51,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"mydict = mutils.initialize_dict(images[0:10])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"mydict"
|
||||
"mydict = mutils.initialize_dict(images)"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@ -1,11 +1,12 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "dcaa3da1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Notebook for text extraction on image\n",
|
||||
"# Text extraction on image\n",
|
||||
"Inga Ulusoy, SSC, July 2022"
|
||||
]
|
||||
},
|
||||
@ -44,9 +45,7 @@
|
||||
"import misinformation\n",
|
||||
"from misinformation import utils as mutils\n",
|
||||
"from misinformation import display as mdisplay\n",
|
||||
"import tensorflow as tf\n",
|
||||
"\n",
|
||||
"print(tf.config.list_physical_devices(\"GPU\"))"
|
||||
"import tensorflow as tf"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -68,7 +67,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"images = mutils.find_files(path=\"../data/all/\", limit=1000)"
|
||||
"images = mutils.find_files(path=\"data\", limit=10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -78,7 +77,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for i in images[0:3]:\n",
|
||||
"for i in images:\n",
|
||||
" display(Image(filename=i))"
|
||||
]
|
||||
},
|
||||
@ -89,30 +88,19 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"mydict = mutils.initialize_dict(images[0:3])"
|
||||
"mydict = mutils.initialize_dict(images)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "7b8b929f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# google cloud vision API\n",
|
||||
"## google cloud vision API\n",
|
||||
"First 1000 images per month are free."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"os.environ[\n",
|
||||
" \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
|
||||
"] = \"../data/misinformation-campaign-981aa55a3b13.json\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0891b795-c7fe-454c-a45d-45fadf788142",
|
||||
@ -193,144 +181,6 @@
|
||||
"# Write the csv\n",
|
||||
"df.to_csv(\"./data_out.csv\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4bc8ac0a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Topic analysis\n",
|
||||
"The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4931941b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
|
||||
"### Option 1: Use the dictionary as obtained from the above analysis."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "a3450a61",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
|
||||
"topic_model, topic_df, most_frequent_topics = misinformation.text.PostprocessText(\n",
|
||||
" mydict=mydict\n",
|
||||
").analyse_topic()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "95667342",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Option 2: Read in a csv\n",
|
||||
"Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "5530e436",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"input_file_path = \"data_out.csv\"\n",
|
||||
"topic_model, topic_df, most_frequent_topics = misinformation.text.PostprocessText(\n",
|
||||
" use_csv=True, csv_path=input_file_path\n",
|
||||
").analyse_topic(return_topics=10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0b6ef6d7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Access frequent topics\n",
|
||||
"A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(topic_df)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b3316770",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Get information for specific topic\n",
|
||||
"The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "db14fe03",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for topic in most_frequent_topics:\n",
|
||||
" print(\"Topic:\", topic)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d10f701e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Topic visualization\n",
|
||||
"The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "2331afe6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"topic_model.visualize_topics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f4eaf353",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Save the model\n",
|
||||
"The model can be saved for future use."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "e5e8377c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"topic_model.save(\"misinfo_posts\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7c94edb9",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
||||
@ -30,6 +30,7 @@ dependencies = [
|
||||
"importlib_metadata",
|
||||
"ipython",
|
||||
"ipywidgets",
|
||||
"ipykernel",
|
||||
"matplotlib",
|
||||
"numpy<=1.23.4",
|
||||
"pandas",
|
||||
|
||||
Загрузка…
x
Ссылка в новой задаче
Block a user