{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "dcaa3da1", "metadata": {}, "source": [ "# Notebook for text extraction on image\n", "\n", "The text extraction and analysis is carried out using a variety of tools: \n", "\n", "1. Text extraction from the image using [google-cloud-vision](https://cloud.google.com/vision) \n", "1. Language detection of the extracted text using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n", "1. Translation into English or other languages using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n", "1. Cleaning of the text using [spacy](https://spacy.io/) \n", "1. Spell-check using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n", "1. Subjectivity analysis using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n", "1. Text summarization using [transformers](https://huggingface.co/docs/transformers/index) pipelines\n", "1. Sentiment analysis using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n", "1. Named entity recognition using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n", "1. Topic analysis using [BERTopic](https://github.com/MaartenGr/BERTopic) \n", "\n", "The first cell is only run on google colab and installs the [ammico](https://github.com/ssciwr/AMMICO) package.\n", "\n", "After that, we can import `ammico` and read in the files given a folder path." ] }, { "cell_type": "code", "execution_count": 1, "id": "f43f327c", "metadata": { "execution": { "iopub.execute_input": "2023-06-23T12:23:00.786489Z", "iopub.status.busy": "2023-06-23T12:23:00.785951Z", "iopub.status.idle": "2023-06-23T12:23:00.794750Z", "shell.execute_reply": "2023-06-23T12:23:00.794192Z" } }, "outputs": [], "source": [ "# if running on google colab\n", "# flake8-noqa-cell\n", "import os\n", "\n", "if \"google.colab\" in str(get_ipython()):\n", " # update python version\n", " # install setuptools\n", " # %pip install setuptools==61 -qqq\n", " # install ammico\n", " %pip install git+https://github.com/ssciwr/ammico.git -qqq\n", " # mount google drive for data and API key\n", " from google.colab import drive\n", "\n", " drive.mount(\"/content/drive\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "cf362e60", "metadata": { "execution": { "iopub.execute_input": "2023-06-23T12:23:00.797570Z", "iopub.status.busy": "2023-06-23T12:23:00.797005Z", "iopub.status.idle": "2023-06-23T12:23:14.434068Z", "shell.execute_reply": "2023-06-23T12:23:14.433419Z" } }, "outputs": [], "source": [ "import os\n", "import ammico\n", "from ammico import utils as mutils\n", "from ammico import display as mdisplay" ] }, { "attachments": {}, "cell_type": "markdown", "id": "fddba721", "metadata": {}, "source": [ "We select a subset of image files to try the text extraction on, see the `limit` keyword. The `find_files` function finds image files within a given directory: " ] }, { "cell_type": "code", "execution_count": 3, "id": "27675810", "metadata": { "execution": { "iopub.execute_input": "2023-06-23T12:23:14.437891Z", "iopub.status.busy": "2023-06-23T12:23:14.437175Z", "iopub.status.idle": "2023-06-23T12:23:14.441193Z", "shell.execute_reply": "2023-06-23T12:23:14.440570Z" } }, "outputs": [], "source": [ "# Here you need to provide the path to your google drive folder\n", "# or local folder containing the images\n", "images = mutils.find_files(\n", " path=\"data/\",\n", " limit=10,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3a7dfe11", "metadata": {}, "source": [ "We need to initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis:" ] }, { "cell_type": "code", "execution_count": 4, "id": "8b32409f", "metadata": { "execution": { "iopub.execute_input": "2023-06-23T12:23:14.443971Z", "iopub.status.busy": "2023-06-23T12:23:14.443642Z", "iopub.status.idle": "2023-06-23T12:23:14.446733Z", "shell.execute_reply": "2023-06-23T12:23:14.446119Z" } }, "outputs": [], "source": [ "mydict = mutils.initialize_dict(images)" ] }, { "cell_type": "markdown", "id": "7b8b929f", "metadata": {}, "source": [ "## Google cloud vision API\n", "\n", "For this you need an API key and have the app activated in your google console. The first 1000 images per month are free (July 2022)." ] }, { "attachments": {}, "cell_type": "markdown", "id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986", "metadata": {}, "source": [ "```\n", "os.environ[\n", " \"GOOGLE_APPLICATION_CREDENTIALS\"\n", "] = \"your-credentials.json\"\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0891b795-c7fe-454c-a45d-45fadf788142", "metadata": {}, "source": [ "## Inspect the elements per image\n", "To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below.\n", "Here, we display the text extraction and translation results provided by the above libraries. Click on the tabs to see the results in the right sidebar. You may need to increment the `port` number if you are already running several notebook instances on the same server." ] }, { "cell_type": "code", "execution_count": 5, "id": "7c6ecc88", "metadata": { "execution": { "iopub.execute_input": "2023-06-23T12:23:14.449813Z", "iopub.status.busy": "2023-06-23T12:23:14.449608Z", "iopub.status.idle": "2023-06-23T12:23:15.481829Z", "shell.execute_reply": "2023-06-23T12:23:15.481213Z" } }, "outputs": [ { "ename": "TypeError", "evalue": "__init__() got an unexpected keyword argument 'identify'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[5], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m analysis_explorer \u001b[38;5;241m=\u001b[39m \u001b[43mmdisplay\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mAnalysisExplorer\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43midentify\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mtext-on-image\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2\u001b[0m analysis_explorer\u001b[38;5;241m.\u001b[39mrun_server(port\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m8054\u001b[39m)\n", "\u001b[0;31mTypeError\u001b[0m: __init__() got an unexpected keyword argument 'identify'" ] } ], "source": [ "analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify=\"text-on-image\")\n", "analysis_explorer.run_server(port=8054)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9c3e72b5-0e57-4019-b45e-3e36a74e7f52", "metadata": {}, "source": [ "## Or directly analyze for further processing\n", "Instead of inspecting each of the images, you can also directly carry out the analysis and export the result into a csv. This may take a while depending on how many images you have loaded. Set the keyword `analyse_text` to `True` if you want the text to be analyzed (spell check, subjectivity, text summary, sentiment, NER)." ] }, { "cell_type": "code", "execution_count": 6, "id": "365c78b1-7ff4-4213-86fa-6a0a2d05198f", "metadata": { "execution": { "iopub.execute_input": "2023-06-23T12:23:15.485576Z", "iopub.status.busy": "2023-06-23T12:23:15.485122Z", "iopub.status.idle": "2023-06-23T12:24:30.756386Z", "shell.execute_reply": "2023-06-23T12:24:30.748510Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)/a4f8f3e/config.json: 0.00B [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)/a4f8f3e/config.json: 1.80kB [00:00, 6.30MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 0%| | 0.00/1.22G [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 1%| | 10.5M/1.22G [00:00<00:13, 86.7MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 2%|▏ | 21.0M/1.22G [00:00<00:12, 94.2MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 3%|▎ | 31.5M/1.22G [00:00<00:12, 97.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 4%|▍ | 52.4M/1.22G [00:00<00:11, 101MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 5%|▌ | 62.9M/1.22G [00:00<00:11, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 6%|▌ | 73.4M/1.22G [00:00<00:11, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 7%|▋ | 83.9M/1.22G [00:00<00:11, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 8%|▊ | 94.4M/1.22G [00:00<00:11, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 9%|▊ | 105M/1.22G [00:01<00:11, 100MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 9%|▉ | 115M/1.22G [00:01<00:11, 99.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 10%|█ | 126M/1.22G [00:01<00:10, 101MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 11%|█ | 136M/1.22G [00:01<00:10, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 12%|█▏ | 147M/1.22G [00:01<00:10, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 13%|█▎ | 157M/1.22G [00:01<00:10, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 14%|█▎ | 168M/1.22G [00:01<00:10, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 15%|█▍ | 178M/1.22G [00:01<00:10, 99.3MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 15%|█▌ | 189M/1.22G [00:01<00:10, 100MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 16%|█▋ | 199M/1.22G [00:01<00:10, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 17%|█▋ | 210M/1.22G [00:02<00:10, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 18%|█▊ | 220M/1.22G [00:02<00:09, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 19%|█▉ | 231M/1.22G [00:02<00:09, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 20%|█▉ | 241M/1.22G [00:02<00:09, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 21%|██ | 252M/1.22G [00:02<00:09, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 21%|██▏ | 262M/1.22G [00:02<00:09, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 23%|██▎ | 283M/1.22G [00:02<00:09, 103MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 24%|██▍ | 294M/1.22G [00:02<00:09, 103MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 25%|██▍ | 304M/1.22G [00:03<00:09, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 26%|██▌ | 315M/1.22G [00:03<00:09, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 27%|██▋ | 325M/1.22G [00:03<00:08, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 27%|██▋ | 336M/1.22G [00:03<00:08, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 28%|██▊ | 346M/1.22G [00:03<00:08, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 29%|██▉ | 357M/1.22G [00:03<00:08, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 30%|███ | 367M/1.22G [00:03<00:08, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 31%|███ | 377M/1.22G [00:03<00:08, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 32%|███▏ | 388M/1.22G [00:03<00:08, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 33%|███▎ | 398M/1.22G [00:03<00:08, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 33%|███▎ | 409M/1.22G [00:04<00:07, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 34%|███▍ | 419M/1.22G [00:04<00:08, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 35%|███▌ | 430M/1.22G [00:04<00:07, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 36%|███▌ | 440M/1.22G [00:04<00:07, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 37%|███▋ | 451M/1.22G [00:04<00:07, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 38%|███▊ | 461M/1.22G [00:04<00:07, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 39%|███▊ | 472M/1.22G [00:04<00:07, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 39%|███▉ | 482M/1.22G [00:04<00:07, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 40%|████ | 493M/1.22G [00:04<00:07, 103MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 41%|████ | 503M/1.22G [00:04<00:07, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 42%|████▏ | 514M/1.22G [00:05<00:06, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 43%|████▎ | 524M/1.22G [00:05<00:06, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 44%|████▍ | 535M/1.22G [00:05<00:06, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 45%|████▍ | 545M/1.22G [00:05<00:06, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 45%|████▌ | 556M/1.22G [00:05<00:06, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 46%|████▋ | 566M/1.22G [00:05<00:06, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 47%|████▋ | 577M/1.22G [00:05<00:06, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 48%|████▊ | 587M/1.22G [00:05<00:06, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 49%|████▉ | 598M/1.22G [00:05<00:06, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 50%|████▉ | 608M/1.22G [00:06<00:06, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 51%|█████ | 619M/1.22G [00:06<00:05, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 51%|█████▏ | 629M/1.22G [00:06<00:05, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 52%|█████▏ | 640M/1.22G [00:06<00:05, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 53%|█████▎ | 650M/1.22G [00:06<00:05, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 54%|█████▍ | 661M/1.22G [00:06<00:05, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 55%|█████▍ | 671M/1.22G [00:06<00:05, 99.6MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 56%|█████▌ | 682M/1.22G [00:06<00:05, 97.5MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 57%|█████▋ | 692M/1.22G [00:06<00:05, 98.5MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 57%|█████▋ | 703M/1.22G [00:06<00:05, 100MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 58%|█████▊ | 713M/1.22G [00:07<00:05, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 59%|█████▉ | 724M/1.22G [00:07<00:04, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 60%|██████ | 734M/1.22G [00:07<00:04, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 61%|██████ | 744M/1.22G [00:07<00:04, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 62%|██████▏ | 755M/1.22G [00:07<00:04, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 63%|██████▎ | 765M/1.22G [00:07<00:05, 84.6MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 64%|██████▍ | 786M/1.22G [00:07<00:04, 104MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 66%|██████▌ | 807M/1.22G [00:08<00:04, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 67%|██████▋ | 818M/1.22G [00:08<00:05, 68.4MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 68%|██████▊ | 828M/1.22G [00:08<00:05, 74.6MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 69%|██████▉ | 849M/1.22G [00:08<00:03, 98.0MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 71%|███████ | 870M/1.22G [00:08<00:03, 98.5MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 73%|███████▎ | 891M/1.22G [00:09<00:04, 82.5MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 75%|███████▍ | 912M/1.22G [00:09<00:03, 101MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 76%|███████▋ | 933M/1.22G [00:09<00:02, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 78%|███████▊ | 954M/1.22G [00:10<00:04, 53.7MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 80%|███████▉ | 975M/1.22G [00:10<00:03, 66.0MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 81%|████████ | 986M/1.22G [00:10<00:03, 68.6MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 82%|████████▏ | 1.01G/1.22G [00:10<00:02, 84.7MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 84%|████████▍ | 1.03G/1.22G [00:10<00:02, 89.6MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 86%|████████▌ | 1.05G/1.22G [00:11<00:01, 93.2MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 88%|████████▊ | 1.07G/1.22G [00:11<00:01, 96.0MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 89%|████████▉ | 1.09G/1.22G [00:11<00:01, 96.4MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 90%|█████████ | 1.10G/1.22G [00:11<00:01, 97.6MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 91%|█████████ | 1.11G/1.22G [00:11<00:01, 98.5MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 92%|█████████▏| 1.12G/1.22G [00:11<00:01, 99.4MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 93%|█████████▎| 1.13G/1.22G [00:11<00:00, 100MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 94%|█████████▎| 1.14G/1.22G [00:11<00:00, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 94%|█████████▍| 1.15G/1.22G [00:12<00:00, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 95%|█████████▌| 1.16G/1.22G [00:12<00:00, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 96%|█████████▌| 1.17G/1.22G [00:12<00:00, 96.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 97%|█████████▋| 1.18G/1.22G [00:12<00:00, 98.4MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 98%|█████████▊| 1.20G/1.22G [00:12<00:00, 99.5MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 99%|█████████▊| 1.21G/1.22G [00:12<00:00, 100MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 100%|█████████▉| 1.22G/1.22G [00:12<00:00, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 100%|██████████| 1.22G/1.22G [00:12<00:00, 95.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)okenizer_config.json: 0%| | 0.00/26.0 [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)okenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 25.8kB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)e/a4f8f3e/vocab.json: 0.00B [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)e/a4f8f3e/vocab.json: 899kB [00:00, 11.5MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)e/a4f8f3e/merges.txt: 0.00B [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)e/a4f8f3e/merges.txt: 456kB [00:00, 8.34MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)/af0f99b/config.json: 0%| | 0.00/629 [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)/af0f99b/config.json: 100%|██████████| 629/629 [00:00<00:00, 615kB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 0%| | 0.00/268M [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 4%|▍ | 10.5M/268M [00:00<00:04, 53.5MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 8%|▊ | 21.0M/268M [00:00<00:03, 74.0MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 12%|█▏ | 31.5M/268M [00:00<00:02, 81.0MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 16%|█▌ | 41.9M/268M [00:00<00:02, 86.9MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 20%|█▉ | 52.4M/268M [00:00<00:02, 90.9MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 23%|██▎ | 62.9M/268M [00:00<00:02, 94.6MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 27%|██▋ | 73.4M/268M [00:00<00:02, 96.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 31%|███▏ | 83.9M/268M [00:00<00:01, 95.4MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 35%|███▌ | 94.4M/268M [00:01<00:03, 56.5MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 43%|████▎ | 115M/268M [00:01<00:02, 71.0MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 47%|████▋ | 126M/268M [00:01<00:01, 76.7MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 51%|█████ | 136M/268M [00:01<00:01, 82.4MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 55%|█████▍ | 147M/268M [00:01<00:01, 87.1MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 59%|█████▊ | 157M/268M [00:01<00:01, 91.2MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 63%|██████▎ | 168M/268M [00:02<00:01, 94.3MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 67%|██████▋ | 178M/268M [00:02<00:01, 68.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 70%|███████ | 189M/268M [00:02<00:01, 52.3MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 82%|████████▏ | 220M/268M [00:02<00:00, 86.1MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 90%|█████████ | 241M/268M [00:02<00:00, 91.3MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 94%|█████████▍| 252M/268M [00:03<00:00, 92.6MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 98%|█████████▊| 262M/268M [00:03<00:00, 94.4MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 100%|██████████| 268M/268M [00:03<00:00, 82.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)okenizer_config.json: 0%| | 0.00/48.0 [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)okenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 46.2kB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)ve/af0f99b/vocab.txt: 0.00B [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)ve/af0f99b/vocab.txt: 232kB [00:00, 6.12MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)/f2482bf/config.json: 0%| | 0.00/998 [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)/f2482bf/config.json: 100%|██████████| 998/998 [00:00<00:00, 960kB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 0%| | 0.00/1.33G [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 1%| | 10.5M/1.33G [00:00<00:12, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 2%|▏ | 21.0M/1.33G [00:00<00:12, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 2%|▏ | 31.5M/1.33G [00:00<00:12, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 3%|▎ | 41.9M/1.33G [00:00<00:12, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 4%|▍ | 52.4M/1.33G [00:00<00:12, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 5%|▍ | 62.9M/1.33G [00:00<00:12, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 6%|▌ | 73.4M/1.33G [00:00<00:12, 98.1MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 6%|▋ | 83.9M/1.33G [00:00<00:12, 97.4MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 7%|▋ | 94.4M/1.33G [00:00<00:12, 97.7MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 8%|▊ | 105M/1.33G [00:01<00:12, 99.2MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 9%|▊ | 115M/1.33G [00:01<00:12, 99.4MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 9%|▉ | 126M/1.33G [00:01<00:12, 100MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 10%|█ | 136M/1.33G [00:01<00:12, 99.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 11%|█ | 147M/1.33G [00:01<00:11, 101MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 12%|█▏ | 157M/1.33G [00:01<00:11, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 13%|█▎ | 168M/1.33G [00:01<00:11, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 13%|█▎ | 178M/1.33G [00:01<00:11, 99.9MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 14%|█▍ | 189M/1.33G [00:01<00:11, 99.9MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 15%|█▍ | 199M/1.33G [00:01<00:11, 101MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 16%|█▌ | 210M/1.33G [00:02<00:11, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 17%|█▋ | 220M/1.33G [00:02<00:11, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 17%|█▋ | 231M/1.33G [00:02<00:10, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 18%|█▊ | 241M/1.33G [00:02<00:11, 95.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 19%|█▉ | 252M/1.33G [00:02<00:11, 97.9MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 20%|█▉ | 262M/1.33G [00:03<00:24, 43.3MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 20%|██ | 273M/1.33G [00:03<00:20, 52.3MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 22%|██▏ | 294M/1.33G [00:03<00:13, 77.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 24%|██▎ | 315M/1.33G [00:03<00:11, 89.2MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 25%|██▌ | 336M/1.33G [00:03<00:11, 89.9MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 27%|██▋ | 357M/1.33G [00:03<00:10, 93.7MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 28%|██▊ | 377M/1.33G [00:04<00:09, 96.0MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 29%|██▉ | 388M/1.33G [00:04<00:09, 97.1MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 30%|██▉ | 398M/1.33G [00:04<00:09, 97.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 31%|███ | 409M/1.33G [00:04<00:09, 98.7MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 31%|███▏ | 419M/1.33G [00:04<00:09, 99.9MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 32%|███▏ | 430M/1.33G [00:04<00:08, 101MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 33%|███▎ | 440M/1.33G [00:04<00:08, 99.5MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 35%|███▍ | 461M/1.33G [00:04<00:08, 102MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 35%|███▌ | 472M/1.33G [00:05<00:08, 103MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 36%|███▌ | 482M/1.33G [00:05<00:08, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 37%|███▋ | 493M/1.33G [00:05<00:08, 103MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 38%|███▊ | 503M/1.33G [00:05<00:08, 103MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 39%|███▊ | 514M/1.33G [00:05<00:07, 103MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 39%|███▉ | 524M/1.33G [00:05<00:07, 103MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 41%|████ | 545M/1.33G [00:05<00:07, 103MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 42%|████▏ | 556M/1.33G [00:05<00:07, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 42%|████▏ | 566M/1.33G [00:05<00:07, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 43%|████▎ | 577M/1.33G [00:06<00:07, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 44%|████▍ | 587M/1.33G [00:06<00:07, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 45%|████▍ | 598M/1.33G [00:06<00:07, 99.4MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 46%|████▌ | 608M/1.33G [00:06<00:07, 99.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 46%|████▋ | 619M/1.33G [00:06<00:16, 42.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 47%|████▋ | 629M/1.33G [00:07<00:13, 50.5MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 48%|████▊ | 640M/1.33G [00:07<00:14, 47.1MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 50%|████▉ | 661M/1.33G [00:07<00:09, 69.9MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 50%|█████ | 671M/1.33G [00:07<00:08, 76.2MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 51%|█████ | 682M/1.33G [00:07<00:07, 81.7MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 52%|█████▏ | 692M/1.33G [00:07<00:07, 86.7MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 53%|█████▎ | 703M/1.33G [00:07<00:07, 90.1MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 53%|█████▎ | 713M/1.33G [00:07<00:06, 93.6MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 54%|█████▍ | 724M/1.33G [00:08<00:06, 95.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 55%|█████▌ | 734M/1.33G [00:08<00:06, 98.1MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 56%|█████▌ | 744M/1.33G [00:08<00:05, 99.1MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 57%|█████▋ | 755M/1.33G [00:08<00:05, 100MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 57%|█████▋ | 765M/1.33G [00:08<00:05, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 58%|█████▊ | 776M/1.33G [00:08<00:05, 97.1MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 59%|█████▉ | 786M/1.33G [00:08<00:05, 99.1MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 60%|█████▉ | 797M/1.33G [00:08<00:05, 100MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 61%|██████ | 807M/1.33G [00:08<00:05, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 61%|██████▏ | 818M/1.33G [00:09<00:05, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 62%|██████▏ | 828M/1.33G [00:09<00:05, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 63%|██████▎ | 839M/1.33G [00:09<00:04, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 64%|██████▎ | 849M/1.33G [00:09<00:04, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 65%|██████▌ | 870M/1.33G [00:09<00:04, 103MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 66%|██████▌ | 881M/1.33G [00:09<00:04, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 67%|██████▋ | 891M/1.33G [00:09<00:04, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 68%|██████▊ | 912M/1.33G [00:09<00:04, 103MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 69%|██████▉ | 923M/1.33G [00:10<00:04, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 70%|██████▉ | 933M/1.33G [00:10<00:03, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 71%|███████ | 944M/1.33G [00:10<00:03, 102MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 72%|███████▏ | 954M/1.33G [00:10<00:03, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 72%|███████▏ | 965M/1.33G [00:10<00:03, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 73%|███████▎ | 975M/1.33G [00:10<00:03, 97.9MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 74%|███████▍ | 986M/1.33G [00:10<00:03, 99.1MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 75%|███████▍ | 996M/1.33G [00:10<00:03, 101MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 75%|███████▌ | 1.01G/1.33G [00:10<00:03, 101MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 76%|███████▌ | 1.02G/1.33G [00:11<00:03, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 77%|███████▋ | 1.03G/1.33G [00:11<00:03, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 78%|███████▊ | 1.04G/1.33G [00:11<00:02, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 79%|███████▊ | 1.05G/1.33G [00:11<00:02, 100MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 79%|███████▉ | 1.06G/1.33G [00:11<00:02, 99.7MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 80%|████████ | 1.07G/1.33G [00:11<00:02, 98.6MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 81%|████████ | 1.08G/1.33G [00:11<00:02, 87.2MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 83%|████████▎ | 1.10G/1.33G [00:11<00:02, 85.2MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 84%|████████▍ | 1.12G/1.33G [00:12<00:01, 110MB/s] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 86%|████████▌ | 1.14G/1.33G [00:12<00:01, 107MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 87%|████████▋ | 1.16G/1.33G [00:12<00:02, 62.3MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 89%|████████▉ | 1.18G/1.33G [00:13<00:02, 72.8MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 90%|█████████ | 1.21G/1.33G [00:13<00:02, 57.0MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 92%|█████████▏| 1.23G/1.33G [00:13<00:01, 73.6MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 94%|█████████▎| 1.25G/1.33G [00:13<00:01, 80.9MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 95%|█████████▌| 1.27G/1.33G [00:14<00:00, 86.4MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 97%|█████████▋| 1.29G/1.33G [00:14<00:00, 79.5MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 98%|█████████▊| 1.31G/1.33G [00:14<00:00, 72.3MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 100%|██████████| 1.33G/1.33G [00:14<00:00, 91.9MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading pytorch_model.bin: 100%|██████████| 1.33G/1.33G [00:14<00:00, 89.7MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)okenizer_config.json: 0%| | 0.00/60.0 [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)okenizer_config.json: 100%|██████████| 60.0/60.0 [00:00<00:00, 58.4kB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)ve/f2482bf/vocab.txt: 0.00B [00:00, ?B/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Downloading (…)ve/f2482bf/vocab.txt: 213kB [00:00, 12.7MB/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "for key in mydict:\n", " mydict[key] = ammico.text.TextDetector(\n", " mydict[key], analyse_text=True\n", " ).analyse_image()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3c063eda", "metadata": {}, "source": [ "## Convert to dataframe and write csv\n", "These steps are required to convert the dictionary of dictionarys into a dictionary with lists, that can be converted into a pandas dataframe and exported to a csv file." ] }, { "cell_type": "code", "execution_count": 7, "id": "5709c2cd", "metadata": { "execution": { "iopub.execute_input": "2023-06-23T12:24:30.784996Z", "iopub.status.busy": "2023-06-23T12:24:30.784319Z", "iopub.status.idle": "2023-06-23T12:24:30.818628Z", "shell.execute_reply": "2023-06-23T12:24:30.818032Z" } }, "outputs": [], "source": [ "outdict = mutils.append_data_to_dict(mydict)\n", "df = mutils.dump_df(outdict)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ae182eb7", "metadata": {}, "source": [ "Check the dataframe:" ] }, { "cell_type": "code", "execution_count": 8, "id": "c4f05637", "metadata": { "execution": { "iopub.execute_input": "2023-06-23T12:24:30.821898Z", "iopub.status.busy": "2023-06-23T12:24:30.821537Z", "iopub.status.idle": "2023-06-23T12:24:30.888205Z", "shell.execute_reply": "2023-06-23T12:24:30.887540Z" } }, "outputs": [ { "data": { "text/html": [ "
| \n", " | filename | \n", "text | \n", "text_language | \n", "text_english | \n", "text_summary | \n", "sentiment | \n", "sentiment_score | \n", "entity | \n", "entity_type | \n", "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "data/106349S_por.png | \n", "NEWS URGENTE SAMSUNG AO VIVO Rio de Janeiro NO... | \n", "pt | \n", "NEWS URGENT SAMSUNG LIVE Rio de Janeiro NEW CO... | \n", "NEW COUNTING METHOD RJ City HALL EXCLUDES 1,1... | \n", "NEGATIVE | \n", "0.99 | \n", "[Rio de Janeiro, C, ##IT, P, ##NA, ##LTO] | \n", "[LOC, ORG, LOC, LOC, ORG, LOC] | \n", "
| 1 | \n", "data/102141_2_eng.png | \n", "CORONAVIRUS QUARANTINE CORONAVIRUS OUTBREAK BE... | \n", "en | \n", "CORONAVIRUS QUARANTINE CORONAVIRUS OUTBREAK BE... | \n", "Coronavirus QUARANTINE CORONAVIRUS OUTBREAK | \n", "NEGATIVE | \n", "0.98 | \n", "[CORONAVIRUS, ##AR, ##TI, ##RONAVIR, ##C, Co] | \n", "[ORG, MISC, MISC, ORG, MISC, MISC] | \n", "
| 2 | \n", "data/102730_eng.png | \n", "400 DEATHS GET E-BOOK X AN Corporation ncy Ser... | \n", "en | \n", "400 DEATHS GET E-BOOK X AN Corporation ncy Ser... | \n", "A municipal worker sprays disinfectant on his... | \n", "NEGATIVE | \n", "0.99 | \n", "[AN Corporation ncy Services, Ahmedabad, RE, #... | \n", "[ORG, LOC, PER, ORG] | \n", "