# Apply D3lta to a generated dataset

In [None]:
import pandas as pd
from d3lta.faissd3lta import semantic_faiss
pd.set_option("max_colwidth", None)

## Synthetic dataset

The dataset has been generated with the help of gpt-3.5-turbo and DeepL. 

Each doc is a text called `original` in the dataset.
- Documents for rewording and copypasta have been generated by a specific `prompt`.
- Documents for translation do not have prompt.
- Documents used to create variations (rewording, copypasta, translation) are seeds (`seed` set to ```True```). Here the texts have been generated so that they can only be a specific duplicate type, for simplicity.
- Documents can be derived from different `text_type` : books, tweets, news
 
`language` of the text is given. 

In [2]:
df_synth = pd.read_csv('../data/synthetic_dataset_documents.csv')
df_synth = df_synth.assign(doc_id=df_synth['doc_id'].astype(str)).set_index(["doc_id"])
df_synth.head(5)

Unnamed: 0_level_0,original,text_type,language,prompt,seed
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10,"Voici que j'achève, avec ce roman, les cinq ouvrages qui m'ont été envoyés dans le cadre d'une opération Masse Critique privilégiée et c'est donc tout naturellement que je commence cette critique en remerciant babelio ainsi que les éditions Kennes car je suis vraiment contente d'avoir découvert leur nouvelle collections K, même si si je reste sur une mauvaise impression avec cette dernière lecture.",books,fr,,True
11,"With this novel, I've completed the five books sent to me as part of a special Critical Mass campaign, so it's only natural that I should start this review by thanking babelio and Kennes, because I'm really pleased to have discovered their new K series, even if I'm still left with a bad impression from this latest read.",books,en,,False
12,"Mit diesem Roman schließe ich die fünf Bücher ab, die mir im Rahmen einer Aktion ""Masse Critique"" zugeschickt wurden, und so ist es nur natürlich, dass ich diese Rezension mit einem Dank an babelio und den Kennes-Verlag beginne, denn ich bin wirklich froh, ihre neue K-Kollektion entdeckt zu haben, auch wenn ich bei der letzten Lektüre einen schlechten Eindruck hatte.",books,de,,False
13,"Com este romance, completei os cinco livros que me foram enviados no âmbito de uma campanha especial da Massa Crítica, por isso é natural que comece esta recensão agradecendo ao babelio e ao Kennes, porque estou muito contente por ter descoberto a sua nova série K, mesmo que ainda tenha ficado com uma má impressão desta última leitura.",books,pt,,False
14,"有了这本小说，我就完成了作为 ""临界质量 ""特别活动的一部分寄给我的五本书，因此，在这篇评论的开头，我自然要感谢 babelio 和 Kennes，因为我真的很高兴发现了他们的新 K 系列，尽管最近这本书给我留下了不好的印象。",books,zh,,False


In [3]:
df_synth.shape

(2985, 5)

## D3lta use

Here we apply `semantic_faiss` to find out matches of texts. `semantic_faiss` will :
- preprocess text and detect language if not given, 
- compute embeddings, 
- create a faiss index,
- find pairs of text closer than the minimal threshold given,
- distinguish different types of duplicated content (copypasta, rewording, translation) given thresholds
- remove duplicated contents if they come from the same user (optional)

The function returns :
- `matches_synth` pairs of text that are duplicated content of each other.
- `df_clusters` initial dataset with cluster of duplicated content

In [4]:
matches_synth, df_clusters = semantic_faiss(
    df = df_synth,
    min_size_txt = 1,
    df_embeddings_use = None,
    embeddings_to_save = 'faiss_synth_test',
    threshold_grapheme = 0.7,
    threshold_language = 0.748, 
    threshold_semantic = 0.7478, 
    remove_matches_same_user = None
)

>>> Start prepare_dataset
Removing 0 short texts over 2985 sentences...
<<< End prepare_dataset, Took: 4.5146 sec
>>> Start compute_embeddings
INFO:tensorflow:Assets written to: use_model_kaggle/assets


INFO:tensorflow:Assets written to: use_model_kaggle/assets


  0%|          | 0/29 [00:00<?, ?it/s]

<<< End compute_embeddings, Took: 42.1617 sec
>>> Start create_index_cosine
C contiguous problem solved
<<< End create_index_cosine, Took: 0.0059 sec
>>> Start find_matches


  0%|          | 0/30 [00:00<?, ?it/s]

<<< End find_matches, Took: 0.3774 sec
>>> Start compute_duplicate_types
<<< End compute_duplicate_types, Took: 0.0875 sec


In [5]:
matches_synth['dup_type'].value_counts()

dup_type
copy-pasta     4252
rewording      3556
translation    2708
Name: count, dtype: int64

Here, 4335 pairs of copy-pasta have been found by the algorithm.

In [6]:
df_clusters.cluster.value_counts(dropna=False)

cluster
NaN      168
0.0       10
67.0      10
54.0      10
55.0      10
        ... 
299.0      2
305.0      2
306.0      2
312.0      2
303.0      2
Name: count, Length: 314, dtype: int64

Some all clusters should contain 10 documents. 168 documents have not been detected as duplicated content.

## Verification with true label

`df_annotated` is a dataset of annotated pairs of text that can be `translation`, `copypasta`, `rewording`, and `nomatch`.

In [7]:
df_annotated = pd.read_csv('../data/synthetic_dataset_pairs_unbalanced.csv', dtype=object)

df_annotated.head(3)

Unnamed: 0,source_target,source,target,original_source,original_target,language_source,language_target,true_label
0,10-11,10,11,"Voici que j'achève, avec ce roman, les cinq ouvrages qui m'ont été envoyés dans le cadre d'une opération Masse Critique privilégiée et c'est donc tout naturellement que je commence cette critique en remerciant babelio ainsi que les éditions Kennes car je suis vraiment contente d'avoir découvert leur nouvelle collections K, même si si je reste sur une mauvaise impression avec cette dernière lecture.","With this novel, I've completed the five books sent to me as part of a special Critical Mass campaign, so it's only natural that I should start this review by thanking babelio and Kennes, because I'm really pleased to have discovered their new K series, even if I'm still left with a bad impression from this latest read.",fr,en,translation
1,10-12,10,12,"Voici que j'achève, avec ce roman, les cinq ouvrages qui m'ont été envoyés dans le cadre d'une opération Masse Critique privilégiée et c'est donc tout naturellement que je commence cette critique en remerciant babelio ainsi que les éditions Kennes car je suis vraiment contente d'avoir découvert leur nouvelle collections K, même si si je reste sur une mauvaise impression avec cette dernière lecture.","Mit diesem Roman schließe ich die fünf Bücher ab, die mir im Rahmen einer Aktion ""Masse Critique"" zugeschickt wurden, und so ist es nur natürlich, dass ich diese Rezension mit einem Dank an babelio und den Kennes-Verlag beginne, denn ich bin wirklich froh, ihre neue K-Kollektion entdeckt zu haben, auch wenn ich bei der letzten Lektüre einen schlechten Eindruck hatte.",fr,de,translation
2,10-13,10,13,"Voici que j'achève, avec ce roman, les cinq ouvrages qui m'ont été envoyés dans le cadre d'une opération Masse Critique privilégiée et c'est donc tout naturellement que je commence cette critique en remerciant babelio ainsi que les éditions Kennes car je suis vraiment contente d'avoir découvert leur nouvelle collections K, même si si je reste sur une mauvaise impression avec cette dernière lecture.","Com este romance, completei os cinco livros que me foram enviados no âmbito de uma campanha especial da Massa Crítica, por isso é natural que comece esta recensão agradecendo ao babelio e ao Kennes, porque estou muito contente por ter descoberto a sua nova série K, mesmo que ainda tenha ficado com uma má impressão desta última leitura.",fr,pt,translation


In [8]:
df_annotated.true_label.value_counts()

true_label
nomatch        1485000
translation       4500
copypasta         4030
rewording         4017
Name: count, dtype: int64

In [9]:
df_eval = (
    df_annotated.merge(
        matches_synth[['duplicates','source','target','dup_type','score','score_lev']], 
        left_on='source_target', 
        right_on='duplicates',
        how='left')
)

df_eval.loc[df_eval.dup_type.isnull(), "dup_type"] = 'nomatch'

In [10]:
pd.crosstab(df_eval.true_label, df_eval.dup_type, dropna=False).reindex(["copypasta", "rewording", "translation","nomatch"])[["copy-pasta", "rewording", "translation", 'nomatch']]

dup_type,copy-pasta,rewording,translation,nomatch
true_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
copypasta,3871,94,0,65
rewording,217,3147,0,653
translation,0,0,2708,1792
nomatch,0,0,0,1485000


The d3lta algorithm : 
- mistook 217 rewording for copypastas & 94 copypastas for rewording
- 0 no match pairs are predicted as duplicated content
- did not predict 2510 duplicated pairs

But the algorithm has a really good precision on duplicated content (no nomatch are wrongly detected as duplicated content by threeedelta).

## Graph

It is possible to visualize the result in a graph. 

/!\ depending on your jupyter version, the graph's vizualisation might not work. See this [issue](https://github.com/medialab/ipysigma/issues/242).

In [12]:
from pelote import tables_to_graph
from ipysigma import Sigma

In [13]:
def create_edges_nodes(matches, nodes_columns):
    # edges
    edges_plot = matches.copy()

    # nodes
    nodes_plot = (
        pd.concat([
            matches[['source'] + list(map(lambda x: x + '_source', nodes_columns))].rename(columns={'source':'id'}).rename(columns = lambda x: x.replace('_source', '')), 
            matches[['target'] + list(map(lambda x: x + '_target', nodes_columns))].rename(columns={'target':'id'}).rename(columns = lambda x: x.replace('_target', ''))], 
            ignore_index=True)
        .drop_duplicates('id')
    )
    nodes_plot["blank"] = " "
    return edges_plot, nodes_plot

In [14]:
edges, nodes = create_edges_nodes(matches_synth, ['text_to_embed','language'])

In [15]:
g = tables_to_graph(
    nodes.reset_index()[["id",'blank','text_to_embed','language']].astype(str),
    edges[["source","target","dup_type","score"]].astype(str), 
    node_col="id",
    node_data=["id",'blank','text_to_embed','language'],
    edge_data=['score',"dup_type"],
)

graph_sigma = Sigma(g, 
                    node_label="text_to_embed",
                    default_node_size =.5,
                    edge_color="dup_type",
                    default_edge_type="curve", 
                    node_border_color_from="node",
                    label_density=5,
                    label_rendered_size_threshold=0.0000001,
     )

graph_sigma

Sigma(nx.Graph with 2,817 nodes and 10,516 edges)