{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b600b06a",
   "metadata": {},
   "source": [
    "# Apply D3lta to a generated dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e9349ad-d702-4cee-8d04-ead6eed903fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from d3lta.faissd3lta import semantic_faiss\n",
    "pd.set_option(\"max_colwidth\", None)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fb20f6d",
   "metadata": {},
   "source": [
    "## Synthetic dataset\n",
    "\n",
    "The dataset has been generated with the help of gpt-3.5-turbo and DeepL. \n",
    "\n",
    "Each doc is a text called `original` in the dataset.\n",
    "- Documents for rewording and copypasta have been generated by a specific `prompt`.\n",
    "- Documents for translation do not have prompt.\n",
    "- Documents used to create variations (rewording, copypasta, translation) are seeds (`seed` set to ```True```). Here the texts have been generated so that they can only be a specific duplicate type, for simplicity.\n",
    "- Documents can be derived from different `text_type` : books, tweets, news\n",
    " \n",
    "`language` of the text is given. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7017280b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-13T10:14:42.661159Z",
     "iopub.status.busy": "2024-09-13T10:14:42.660425Z",
     "iopub.status.idle": "2024-09-13T10:14:42.683536Z",
     "shell.execute_reply": "2024-09-13T10:14:42.682901Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
\n",
       "\n",
       "
\n",
       "  \n",
       "    \n",
       "      | \n",
       " | original\n",
       " | text_type\n",
       " | language\n",
       " | prompt\n",
       " | seed\n",
       " | 
\n",
       "    \n",
       "      | doc_id\n",
       " | \n",
       " | \n",
       " | \n",
       " | \n",
       " | \n",
       " | 
\n",
       "  \n",
       "  \n",
       "    \n",
       "      | 10\n",
       " | Voici que j'achève, avec ce roman, les cinq ouvrages qui m'ont été envoyés dans le cadre d'une opération Masse Critique privilégiée et c'est donc tout naturellement que je commence cette critique en remerciant babelio ainsi que les éditions Kennes car je suis vraiment contente d'avoir découvert leur nouvelle collections K, même si si je reste sur une mauvaise impression avec cette dernière lecture.\n",
       " | books\n",
       " | fr\n",
       " | NaN\n",
       " | True\n",
       " | 
\n",
       "    \n",
       "      | 11\n",
       " | With this novel, I've completed the five books sent to me as part of a special Critical Mass campaign, so it's only natural that I should start this review by thanking babelio and Kennes, because I'm really pleased to have discovered their new K series, even if I'm still left with a bad impression from this latest read.\n",
       " | books\n",
       " | en\n",
       " | NaN\n",
       " | False\n",
       " | 
\n",
       "    \n",
       "      | 12\n",
       " | Mit diesem Roman schließe ich die fünf Bücher ab, die mir im Rahmen einer Aktion \"Masse Critique\" zugeschickt wurden, und so ist es nur natürlich, dass ich diese Rezension mit einem Dank an babelio und den Kennes-Verlag beginne, denn ich bin wirklich froh, ihre neue K-Kollektion entdeckt zu haben, auch wenn ich bei der letzten Lektüre einen schlechten Eindruck hatte.\n",
       " | books\n",
       " | de\n",
       " | NaN\n",
       " | False\n",
       " | 
\n",
       "    \n",
       "      | 13\n",
       " | Com este romance, completei os cinco livros que me foram enviados no âmbito de uma campanha especial da Massa Crítica, por isso é natural que comece esta recensão agradecendo ao babelio e ao Kennes, porque estou muito contente por ter descoberto a sua nova série K, mesmo que ainda tenha ficado com uma má impressão desta última leitura.\n",
       " | books\n",
       " | pt\n",
       " | NaN\n",
       " | False\n",
       " | 
\n",
       "    \n",
       "      | 14\n",
       " | 有了这本小说,我就完成了作为 \"临界质量 \"特别活动的一部分寄给我的五本书,因此,在这篇评论的开头,我自然要感谢 babelio 和 Kennes,因为我真的很高兴发现了他们的新 K 系列,尽管最近这本书给我留下了不好的印象。\n",
       " | books\n",
       " | zh\n",
       " | NaN\n",
       " | False\n",
       " | 
\n",
       "  \n",
       "
\n",
       "
\n",
       "\n",
       "
\n",
       "  \n",
       "    \n",
       "      | \n",
       " | source_target\n",
       " | source\n",
       " | target\n",
       " | original_source\n",
       " | original_target\n",
       " | language_source\n",
       " | language_target\n",
       " | true_label\n",
       " | 
\n",
       "  \n",
       "  \n",
       "    \n",
       "      | 0\n",
       " | 10-11\n",
       " | 10\n",
       " | 11\n",
       " | Voici que j'achève, avec ce roman, les cinq ouvrages qui m'ont été envoyés dans le cadre d'une opération Masse Critique privilégiée et c'est donc tout naturellement que je commence cette critique en remerciant babelio ainsi que les éditions Kennes car je suis vraiment contente d'avoir découvert leur nouvelle collections K, même si si je reste sur une mauvaise impression avec cette dernière lecture.\n",
       " | With this novel, I've completed the five books sent to me as part of a special Critical Mass campaign, so it's only natural that I should start this review by thanking babelio and Kennes, because I'm really pleased to have discovered their new K series, even if I'm still left with a bad impression from this latest read.\n",
       " | fr\n",
       " | en\n",
       " | translation\n",
       " | 
\n",
       "    \n",
       "      | 1\n",
       " | 10-12\n",
       " | 10\n",
       " | 12\n",
       " | Voici que j'achève, avec ce roman, les cinq ouvrages qui m'ont été envoyés dans le cadre d'une opération Masse Critique privilégiée et c'est donc tout naturellement que je commence cette critique en remerciant babelio ainsi que les éditions Kennes car je suis vraiment contente d'avoir découvert leur nouvelle collections K, même si si je reste sur une mauvaise impression avec cette dernière lecture.\n",
       " | Mit diesem Roman schließe ich die fünf Bücher ab, die mir im Rahmen einer Aktion \"Masse Critique\" zugeschickt wurden, und so ist es nur natürlich, dass ich diese Rezension mit einem Dank an babelio und den Kennes-Verlag beginne, denn ich bin wirklich froh, ihre neue K-Kollektion entdeckt zu haben, auch wenn ich bei der letzten Lektüre einen schlechten Eindruck hatte.\n",
       " | fr\n",
       " | de\n",
       " | translation\n",
       " | 
\n",
       "    \n",
       "      | 2\n",
       " | 10-13\n",
       " | 10\n",
       " | 13\n",
       " | Voici que j'achève, avec ce roman, les cinq ouvrages qui m'ont été envoyés dans le cadre d'une opération Masse Critique privilégiée et c'est donc tout naturellement que je commence cette critique en remerciant babelio ainsi que les éditions Kennes car je suis vraiment contente d'avoir découvert leur nouvelle collections K, même si si je reste sur une mauvaise impression avec cette dernière lecture.\n",
       " | Com este romance, completei os cinco livros que me foram enviados no âmbito de uma campanha especial da Massa Crítica, por isso é natural que comece esta recensão agradecendo ao babelio e ao Kennes, porque estou muito contente por ter descoberto a sua nova série K, mesmo que ainda tenha ficado com uma má impressão desta última leitura.\n",
       " | fr\n",
       " | pt\n",
       " | translation\n",
       " | 
\n",
       "  \n",
       "
\n",
       "
\n",
       "\n",
       "
\n",
       "  \n",
       "    \n",
       "      | dup_type\n",
       " | copy-pasta\n",
       " | rewording\n",
       " | translation\n",
       " | nomatch\n",
       " | 
\n",
       "    \n",
       "      | true_label\n",
       " | \n",
       " | \n",
       " | \n",
       " | \n",
       " | 
\n",
       "  \n",
       "  \n",
       "    \n",
       "      | copypasta\n",
       " | 3871\n",
       " | 94\n",
       " | 0\n",
       " | 65\n",
       " | 
\n",
       "    \n",
       "      | rewording\n",
       " | 217\n",
       " | 3147\n",
       " | 0\n",
       " | 653\n",
       " | 
\n",
       "    \n",
       "      | translation\n",
       " | 0\n",
       " | 0\n",
       " | 2708\n",
       " | 1792\n",
       " | 
\n",
       "    \n",
       "      | nomatch\n",
       " | 0\n",
       " | 0\n",
       " | 0\n",
       " | 1485000\n",
       " | 
\n",
       "  \n",
       "
\n",
       "