UNICRI-Decode-CW/2-Named-Entity-Recognition.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2 - Named Entity Recognition:\n",
    "\n",
    "This is just a raw example, how the entity recogniton works.\n",
    "It uses the pre-trained language based models, based on news, wikipedia or website data. The initial creation of such model is expensive on computational resources. \n",
    "See this lik for more details: https://spacy.io/models\n",
    "Note that this techique is language sensitive and you have to know in advance, what is the language of your dataset. Our training example is in English.\n",
    "\n",
    "This is the link to Ukrainian model that you may use from the beginning and add additional precision by trianing it on the field data: https://spacy.io/models/uk\n",
    "\n",
    "More reading: https://towardsdatascience.com/named-entity-recognition-ner-using-spacy-nlp-part-4-28da2ece57c6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, set up the proper directory:\n",
    "(You don't have to change the line below if you want to process the default dataset.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "#----\n",
    "directory=\"./dataset-zelenskywarcriminal/all-lang/\"\n",
    "#----"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 3/3 [00:00<00:00, 74.57it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Consolidation of the files into one dataframe...\n",
      "Dataframe generated!\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "import gzip\n",
    "import pandas as pd\n",
    "from tqdm import tqdm\n",
    "import os\n",
    "\n",
    "\n",
    "df_ac=pd.DataFrame()\n",
    "print(\"Consolidation of the files into one dataframe...\")\n",
    "\n",
    "for fil in tqdm(os.listdir(directory)):\n",
    "    flen=len(os.listdir(directory))\n",
    "    file=fil\n",
    "    #print(file)\n",
    "    filename=str(str(directory)+(os.fsdecode(fil)))\n",
    "    if filename.endswith(\".gz\"):\n",
    "        try:\n",
    "            df_act = pd.read_json(filename,lines=True)\n",
    "            df_ac=df_ac.append(df_act)\n",
    "        except:\n",
    "            continue\n",
    "        \n",
    "        continue\n",
    "    else:\n",
    "        continue\n",
    "\n",
    "print(\"Dataframe generated!\")\n",
    "#print(df_ac.head())\n",
    "#print(df_ac.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the counts of languages, present in the dataset:\n",
    "\n",
    "For next processing, we chose to work with English only.\n",
    "In the field you will need to use multiple language models, or good translator. In case of the latter, I recommend the Deepl AI.\n",
    "https://www.deepl.com/en/docs-api Compared to Google Translate (as of 5-12-2022) Deepl seems to have better understanding for semantics and provides good API to work with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "lang\n",
       "en      102\n",
       "fr       45\n",
       "it       16\n",
       "es       12\n",
       "qme      10\n",
       "dtype: int64"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#print(df_ac[[\"lang\"]].head(5))\n",
    "#print(df_ac[\"text\"].head(2))\n",
    "df_ac[[\"lang\"]].value_counts().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Choose the language if interest, or keep the current preset for English:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "LANG='en'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "lang\n",
       "en      102\n",
       "dtype: int64"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Filter only the language with available model in spacy\n",
    "df_ac=df_ac.loc[df_ac[\"lang\"]==LANG]\n",
    "df_ac[[\"lang\"]].value_counts().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Codename for the English language model: \"en_core_web_sm\"\n",
    " (See this link to find more models: https://spacy.io/models)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "LANG_MODEL_CODE='en_core_web_sm'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the code below to extract the Named entities, sentiment and keywords from the text:\n",
    "\n",
    "The output is both CSV and \"pickled\" dataframe.  Since the tweets contains interpunction, that breaks the usual CSV formatting, the pickled output is also inlcuded. You can use this \"pickled\" file as an input for further analysis and the formatting will work.\n",
    "Both output files will be saved to the same directory, that keeps this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded slice of dataframe with text.\n",
      "Loaded keywords generated from the text column.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<ipython-input-63-2d83fc9cbbb3>:44: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  df_ac1['text_keywords']=df_ac1['text'].apply(text_keywords_gen)\n",
      "<ipython-input-63-2d83fc9cbbb3>:46: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  df_ac1['sentiment']=df_ac1['text'].apply(sentiment_detector)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sentiment detection finished.\n",
      "NER - Named Entity Recognition finished.\n",
      "Done!\n",
      "Saving df to pickle..\n",
      "Df pickled!\n",
      "Saving pickled DF to csv...\n",
      "df_ac saved to csv!\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<ipython-input-63-2d83fc9cbbb3>:48: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  df_ac1['ner_detection']=df_ac1['text'].apply(ner_extract)\n"
     ]
    }
   ],
   "source": [
    "def text_keywords_gen(user_input): \n",
    "    import requests\n",
    "    import bs4\n",
    "    from nltk.tokenize import sent_tokenize\n",
    "    from gensim.summarization import summarize\n",
    "    from gensim.summarization import keywords\n",
    "    #text= input_texter(user_input)\n",
    "    text=user_input\n",
    "    #print(\"\\nKeywords: \")\n",
    "    #print(keywords(user_input).split(\"\\n\"))\n",
    "    kwrds=keywords(user_input).split(\"\\n\")\n",
    "    return kwrds\n",
    "\n",
    "def sentiment_detector(span):\n",
    "    import spacy\n",
    "    from spacytextblob.spacytextblob import SpacyTextBlob\n",
    "    nlp=spacy.load(LANG_MODEL_CODE)\n",
    "    nlp.add_pipe('spacytextblob')\n",
    "    doc=nlp(span)\n",
    "    sentiment={\"polarity\":doc._.polarity,\"subjectivity\":doc._.subjectivity,\"assessments\":doc._.assessments}\n",
    "    #print(str(sentiment))\n",
    "    return sentiment\n",
    "\n",
    "def ner_extract(sentence):\n",
    "    #Named entity recognition in a span / sentence.\n",
    "    import spacy\n",
    "    nlp=spacy.load(\"en_core_web_lg\")\n",
    "    doc=nlp(sentence)\n",
    "    entities=[]\n",
    "    for ent in doc.ents:\n",
    "        #print(ent.text, ent.start_char, ent.end_char, ent.label_)\n",
    "        entities.append([ent.text, ent.start_char, ent.end_char, ent.label_])\n",
    "    return entities\n",
    "\n",
    "def pickled_df_to_csv(infile_pickle,outfile_csv):\n",
    "    import pandas as pd\n",
    "    df=pd.read_pickle(infile_pickle)\n",
    "    df.to_csv(outfile_csv)\n",
    "    #print(\"CSV file saved to:\" +\" \"+outfile_csv)\n",
    "\n",
    "df_ac1=df_ac[['id','text',\"lang\"]]\n",
    "print(\"Loaded slice of dataframe with text.\")\n",
    "df_ac[\"lang_model\"]=df_ac[\"lang\"].apply(spacy_lang_assign)\n",
    "df_ac1['text_keywords']=df_ac1['text'].apply(text_keywords_gen)\n",
    "print(\"Loaded keywords generated from the text column.\")\n",
    "df_ac1['sentiment']=df_ac1['text'].apply(sentiment_detector)\n",
    "print('Sentiment detection finished.')\n",
    "df_ac1['ner_detection']=df_ac1['text'].apply(ner_extract)\n",
    "print(\"NER - Named Entity Recognition finished.\")\n",
    "print(\"Done!\")\n",
    "#print(df_ac1.head(3))\n",
    "\n",
    "print(\"Saving df to pickle..\")\n",
    "infilef=directory+\"/pickled-twitter-dump-df-keywords\"\n",
    "infilecsv=directory+\"/df_ac-in-csv.csv\"\n",
    "df_ac.to_pickle(infilef,compression=\"infer\",protocol=4)\n",
    "print(\"Df pickled!\")\n",
    "print(\"Saving pickled DF to csv...\")\n",
    "pickled_df_to_csv(infilef,infilecsv)\n",
    "print(\"df_ac saved to csv!\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: In case of trouble, run also code below. It may occur on some systems, that you need to fix the environment in order to run such comptutational heavy operations like Named entity recognition.\n",
    "In provided Virtual Machine the error should be fixed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See the top 10 results for quick check:\n",
    "\n",
    "(For more detalis on entity types, reffer to this link: https://spacy.io/usage/linguistic-features#named-entities )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "unhashable type: 'list'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.map_locations\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;31mTypeError\u001b[0m: unhashable type: 'list'"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'\n",
      "Traceback (most recent call last):\n",
      "  File \"pandas/_libs/hashtable_class_helper.pxi\", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations\n",
      "TypeError: unhashable type: 'list'\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[american, 57, 65, NORP]                  3\n",
       "[RT @TimBronson5:, 0, 16, ORG]            3\n",
       "[RT @NoMoreNATO:, 0, 15, ORG]             3\n",
       "[RT @Stop_This_Evil, 0, 18, ORG]          2\n",
       "[Modi, 22, 26, PERSON]                    2\n",
       "[Nazis, 60, 65, NORP]                     2\n",
       "[@vonderleyen, 0, 12, ORG]                2\n",
       "[ZelenskyWarCriminal, 19, 38, NORP]       2\n",
       "[ZelenskyWarCriminal, 1, 20, ORG]         2\n",
       "[RT, 0, 2, ORG]                           2\n",
       "[RT @NTY57NTY, 0, 12, ORG]                2\n",
       "[@GRDecter, 0, 9, PERSON]                 1\n",
       "[Ukrainian, 22, 31, NORP]                 1\n",
       "[Jewish, 53, 59, NORP]                    1\n",
       "[RT @ianbremmer, 0, 14, ORG]              1\n",
       "[@ZelenskyyUa @RishiSunak, 0, 24, ORG]    1\n",
       "[RT @Rich_Gally_:, 0, 16, ORG]            1\n",
       "[Nigel, 19, 24, PERSON]                   1\n",
       "[millions, 100, 108, CARDINAL]            1\n",
       "[@ZelenskyyUa Be, 14, 29, ORG]            1\n",
       "Name: ner_detection, dtype: int64"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ac1['ner_detection'].str[0].value_counts().head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Keywords results:\n",
    "\n",
    "There are several ways of how to determine what is the keyword in particular text. This method is based on library called \"gensim\" and one of the advanatges is that it is language agnostic.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "                       29\n",
       "zelenskywarcriminal     6\n",
       "zelensky                4\n",
       "times                   3\n",
       "western                 2\n",
       "eur                     2\n",
       "Name: text_keywords, dtype: int64"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ac1['text_keywords'].str[0].value_counts().head(6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sentiment analysis:\n",
    "\n",
    "The method uses combination of rules and pre-trained ML model to determine sentiment from so called \"assessment words\". Since the tweets do not always contain such words, sometimes it is not possible to detect the sentiment properly and the value remains empty.\n",
    "Mode details here: https://spacy.io/universe/project/spacy-textblob\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Series([], Name: sentiment, dtype: int64)"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ac1['sentiment'].str[0].value_counts().head(20)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "vscode": {
   "interpreter": {
    "hash": "7217508cf40a866d2c6d8c05c8a287a7af39b44bc942772df40ff0edc82b5da6"
   }
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 1,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}