* update notebooks and google auth

* update readme and text

* google cred

* update secret name

* add pandocto CI

* pandoc step

* install pandoc

* correct typo
Этот коммит содержится в:
Inga Ulusoy 2023-03-23 06:07:30 -07:00 коммит произвёл GitHub
родитель 6eff8b8145
Коммит 0ca9366980
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
9 изменённых файлов: 38 добавлений и 175 удалений

7
.github/workflows/docs.yml поставляемый
Просмотреть файл

@ -21,6 +21,13 @@ jobs:
run: |
pip install -e .
python -m pip install -r requirements-dev.txt
- name: set google auth
uses: 'google-github-actions/auth@v0.4.0'
with:
credentials_json: '${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}'
- name: get pandoc
run: |
sudo apt-get install -y pandoc
- name: Build documentation
run: |
cd docs

Просмотреть файл

@ -65,8 +65,18 @@ The extracted text is then stored under the `text` key (column when exporting a
If you further want to analyse the text, you have to set the `analyse_text` keyword to `True`. In doing so, the text is then processed using [spacy](https://spacy.io/) (tokenized, part-of-speech, lemma, ...). The English text is cleaned from numbers and unrecognized words (`text_clean`), spelling of the English text is corrected (`text_english_correct`), and further sentiment and subjectivity analysis are carried out (`polarity`, `subjectivity`). The latter two steps are carried out using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html). For more information on the sentiment analysis using TextBlob see [here](https://towardsdatascience.com/my-absolute-go-to-for-sentiment-analysis-textblob-3ac3a11d524).
### Content extraction
The image content ("caption") is extracted using the [LAVIS](https://github.com/salesforce/LAVIS) library. This library enables vision intelligence extraction using several state-of-the-art models, depending on the task. Further, it allows feature extraction from the images, where users can input textual and image queries, and the images in the database are matched to that query (multimodal search). Another option is question answering, where the user inputs a text question and the library finds the images that match the query.
### Emotion recognition
Emotion recognition is carried out using the [deepface](https://github.com/serengil/deepface) and [retinaface](https://github.com/serengil/retinaface) libraries. These libraries detect the presence of faces, and their age, gender, emotion and race based on several state-of-the-art models. It is also detected if the person is wearing a face mask - if they are, then no further detection is carried out as the mask prevents an accurate prediction.
### Object detection
Object detection is carried out using [cvlib](https://github.com/arunponnusamy/cvlib) and the [YOLOv4](https://github.com/AlexeyAB/darknet) model. This library detects faces, people, and several inanimate objects; we currently have restricted the output to person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, cell phone.
### Cropping of posts
Social media posts can automatically be cropped to remove further comments on the page and restrict the textual content to the first comment only.

Просмотреть файл

@ -12,7 +12,7 @@ sys.path.insert(0, os.path.abspath("../../misinformation/"))
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
project = "misinformation"
project = "AMMICO"
copyright = "2022, Scientific Software Center, Heidelberg University"
author = "Scientific Software Center, Heidelberg University"
release = "0.0.1"

Просмотреть файл

@ -3,8 +3,8 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to misinformation's documentation!
==========================================
Welcome to AMMICO's documentation!
==================================
.. toctree::
:maxdepth: 2
@ -12,6 +12,10 @@ Welcome to misinformation's documentation!
readme_link
notebooks/Example faces
notebooks/Example text
notebooks/Example summary
notebooks/Example multimodal
notebooks/Example objects
modules
license_link

Просмотреть файл

@ -1,5 +1,5 @@
misinformation package modules
==============================
AMMICO package modules
======================
.. toctree::
:maxdepth: 4

Двоичный файл не отображается.

Просмотреть файл

@ -40,8 +40,8 @@
"outputs": [],
"source": [
"images = mutils.find_files(\n",
" path=\"../misinformation/test/data/\",\n",
" limit=1000,\n",
" path=\"data/\",\n",
" limit=10,\n",
")"
]
},
@ -51,16 +51,7 @@
"metadata": {},
"outputs": [],
"source": [
"mydict = mutils.initialize_dict(images[0:10])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mydict"
"mydict = mutils.initialize_dict(images)"
]
},
{

Просмотреть файл

@ -1,11 +1,12 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "dcaa3da1",
"metadata": {},
"source": [
"# Notebook for text extraction on image\n",
"# Text extraction on image\n",
"Inga Ulusoy, SSC, July 2022"
]
},
@ -44,9 +45,7 @@
"import misinformation\n",
"from misinformation import utils as mutils\n",
"from misinformation import display as mdisplay\n",
"import tensorflow as tf\n",
"\n",
"print(tf.config.list_physical_devices(\"GPU\"))"
"import tensorflow as tf"
]
},
{
@ -68,7 +67,7 @@
"metadata": {},
"outputs": [],
"source": [
"images = mutils.find_files(path=\"../data/all/\", limit=1000)"
"images = mutils.find_files(path=\"data\", limit=10)"
]
},
{
@ -78,7 +77,7 @@
"metadata": {},
"outputs": [],
"source": [
"for i in images[0:3]:\n",
"for i in images:\n",
" display(Image(filename=i))"
]
},
@ -89,30 +88,19 @@
"metadata": {},
"outputs": [],
"source": [
"mydict = mutils.initialize_dict(images[0:3])"
"mydict = mutils.initialize_dict(images)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7b8b929f",
"metadata": {},
"source": [
"# google cloud vision API\n",
"## google cloud vision API\n",
"First 1000 images per month are free."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986",
"metadata": {},
"outputs": [],
"source": [
"os.environ[\n",
" \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
"] = \"../data/misinformation-campaign-981aa55a3b13.json\""
]
},
{
"cell_type": "markdown",
"id": "0891b795-c7fe-454c-a45d-45fadf788142",
@ -193,144 +181,6 @@
"# Write the csv\n",
"df.to_csv(\"./data_out.csv\")"
]
},
{
"cell_type": "markdown",
"id": "4bc8ac0a",
"metadata": {},
"source": [
"# Topic analysis\n",
"The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
]
},
{
"cell_type": "markdown",
"id": "4931941b",
"metadata": {},
"source": [
"BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
"### Option 1: Use the dictionary as obtained from the above analysis."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a3450a61",
"metadata": {},
"outputs": [],
"source": [
"# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
"topic_model, topic_df, most_frequent_topics = misinformation.text.PostprocessText(\n",
" mydict=mydict\n",
").analyse_topic()"
]
},
{
"cell_type": "markdown",
"id": "95667342",
"metadata": {},
"source": [
"### Option 2: Read in a csv\n",
"Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5530e436",
"metadata": {},
"outputs": [],
"source": [
"input_file_path = \"data_out.csv\"\n",
"topic_model, topic_df, most_frequent_topics = misinformation.text.PostprocessText(\n",
" use_csv=True, csv_path=input_file_path\n",
").analyse_topic(return_topics=10)"
]
},
{
"cell_type": "markdown",
"id": "0b6ef6d7",
"metadata": {},
"source": [
"### Access frequent topics\n",
"A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f",
"metadata": {},
"outputs": [],
"source": [
"print(topic_df)"
]
},
{
"cell_type": "markdown",
"id": "b3316770",
"metadata": {},
"source": [
"### Get information for specific topic\n",
"The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db14fe03",
"metadata": {},
"outputs": [],
"source": [
"for topic in most_frequent_topics:\n",
" print(\"Topic:\", topic)"
]
},
{
"cell_type": "markdown",
"id": "d10f701e",
"metadata": {},
"source": [
"### Topic visualization\n",
"The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2331afe6",
"metadata": {},
"outputs": [],
"source": [
"topic_model.visualize_topics()"
]
},
{
"cell_type": "markdown",
"id": "f4eaf353",
"metadata": {},
"source": [
"### Save the model\n",
"The model can be saved for future use."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e5e8377c",
"metadata": {},
"outputs": [],
"source": [
"topic_model.save(\"misinfo_posts\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7c94edb9",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

Просмотреть файл

@ -30,6 +30,7 @@ dependencies = [
"importlib_metadata",
"ipython",
"ipywidgets",
"ipykernel",
"matplotlib",
"numpy<=1.23.4",
"pandas",