зеркало из
https://github.com/ssciwr/AMMICO.git
synced 2025-10-31 22:16:05 +02:00
3337 строки
163 KiB
Plaintext
3337 строки
163 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "dcaa3da1",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Notebook for text extraction on image\n",
|
|
"\n",
|
|
"The text extraction and analysis is carried out using a variety of tools: \n",
|
|
"\n",
|
|
"1. Text extraction from the image using [google-cloud-vision](https://cloud.google.com/vision) \n",
|
|
"1. Language detection of the extracted text using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
|
|
"1. Translation into English or other languages using [Googletrans](https://py-googletrans.readthedocs.io/en/latest/) \n",
|
|
"1. Cleaning of the text using [spacy](https://spacy.io/) \n",
|
|
"1. Spell-check using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
|
|
"1. Subjectivity analysis using [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) \n",
|
|
"1. Text summarization using [transformers](https://huggingface.co/docs/transformers/index) pipelines\n",
|
|
"1. Sentiment analysis using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
|
|
"1. Named entity recognition using [transformers](https://huggingface.co/docs/transformers/index) pipelines \n",
|
|
"1. Topic analysis using [BERTopic](https://github.com/MaartenGr/BERTopic) \n",
|
|
"\n",
|
|
"The first cell is only run on google colab and installs the [ammico](https://github.com/ssciwr/AMMICO) package.\n",
|
|
"\n",
|
|
"After that, we can import `ammico` and read in the files given a folder path."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "f43f327c",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:00:17.237869Z",
|
|
"iopub.status.busy": "2023-06-28T07:00:17.237649Z",
|
|
"iopub.status.idle": "2023-06-28T07:00:17.245987Z",
|
|
"shell.execute_reply": "2023-06-28T07:00:17.245431Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# if running on google colab\n",
|
|
"# flake8-noqa-cell\n",
|
|
"import os\n",
|
|
"\n",
|
|
"if \"google.colab\" in str(get_ipython()):\n",
|
|
" # update python version\n",
|
|
" # install setuptools\n",
|
|
" # %pip install setuptools==61 -qqq\n",
|
|
" # install ammico\n",
|
|
" %pip install git+https://github.com/ssciwr/ammico.git -qqq\n",
|
|
" # mount google drive for data and API key\n",
|
|
" from google.colab import drive\n",
|
|
"\n",
|
|
" drive.mount(\"/content/drive\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "cf362e60",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:00:17.248550Z",
|
|
"iopub.status.busy": "2023-06-28T07:00:17.248348Z",
|
|
"iopub.status.idle": "2023-06-28T07:00:30.932913Z",
|
|
"shell.execute_reply": "2023-06-28T07:00:30.932263Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import ammico\n",
|
|
"from ammico import utils as mutils\n",
|
|
"from ammico import display as mdisplay"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "fddba721",
|
|
"metadata": {},
|
|
"source": [
|
|
"We select a subset of image files to try the text extraction on, see the `limit` keyword. The `find_files` function finds image files within a given directory: "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "27675810",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:00:30.936544Z",
|
|
"iopub.status.busy": "2023-06-28T07:00:30.935844Z",
|
|
"iopub.status.idle": "2023-06-28T07:00:30.940575Z",
|
|
"shell.execute_reply": "2023-06-28T07:00:30.939991Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Here you need to provide the path to your google drive folder\n",
|
|
"# or local folder containing the images\n",
|
|
"images = mutils.find_files(\n",
|
|
" path=\"data/\",\n",
|
|
" limit=10,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "3a7dfe11",
|
|
"metadata": {},
|
|
"source": [
|
|
"We need to initialize the main dictionary that contains all information for the images and is updated through each subsequent analysis:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "8b32409f",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:00:30.943714Z",
|
|
"iopub.status.busy": "2023-06-28T07:00:30.943149Z",
|
|
"iopub.status.idle": "2023-06-28T07:00:30.946327Z",
|
|
"shell.execute_reply": "2023-06-28T07:00:30.945738Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"mydict = mutils.initialize_dict(images)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7b8b929f",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Google cloud vision API\n",
|
|
"\n",
|
|
"For this you need an API key and have the app activated in your google console. The first 1000 images per month are free (July 2022)."
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "cbf74c0b-52fe-4fb8-b617-f18611e8f986",
|
|
"metadata": {},
|
|
"source": [
|
|
"```\n",
|
|
"os.environ[\n",
|
|
" \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
|
|
"] = \"your-credentials.json\"\n",
|
|
"```"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "0891b795-c7fe-454c-a45d-45fadf788142",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Inspect the elements per image\n",
|
|
"To check the analysis, you can inspect the analyzed elements here. Loading the results takes a moment, so please be patient. If you are sure of what you are doing, you can skip this and directly export a csv file in the step below.\n",
|
|
"Here, we display the text extraction and translation results provided by the above libraries. Click on the tabs to see the results in the right sidebar. You may need to increment the `port` number if you are already running several notebook instances on the same server."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "7c6ecc88",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:00:30.949212Z",
|
|
"iopub.status.busy": "2023-06-28T07:00:30.948876Z",
|
|
"iopub.status.idle": "2023-06-28T07:00:31.715521Z",
|
|
"shell.execute_reply": "2023-06-28T07:00:31.714832Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "TypeError",
|
|
"evalue": "__init__() got an unexpected keyword argument 'identify'",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[5], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m analysis_explorer \u001b[38;5;241m=\u001b[39m \u001b[43mmdisplay\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mAnalysisExplorer\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmydict\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43midentify\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mtext-on-image\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2\u001b[0m analysis_explorer\u001b[38;5;241m.\u001b[39mrun_server(port\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m8054\u001b[39m)\n",
|
|
"\u001b[0;31mTypeError\u001b[0m: __init__() got an unexpected keyword argument 'identify'"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"analysis_explorer = mdisplay.AnalysisExplorer(mydict, identify=\"text-on-image\")\n",
|
|
"analysis_explorer.run_server(port=8054)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "9c3e72b5-0e57-4019-b45e-3e36a74e7f52",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Or directly analyze for further processing\n",
|
|
"Instead of inspecting each of the images, you can also directly carry out the analysis and export the result into a csv. This may take a while depending on how many images you have loaded. Set the keyword `analyse_text` to `True` if you want the text to be analyzed (spell check, subjectivity, text summary, sentiment, NER)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "365c78b1-7ff4-4213-86fa-6a0a2d05198f",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:00:31.719454Z",
|
|
"iopub.status.busy": "2023-06-28T07:00:31.719096Z",
|
|
"iopub.status.idle": "2023-06-28T07:03:16.703670Z",
|
|
"shell.execute_reply": "2023-06-28T07:03:16.702945Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)/a4f8f3e/config.json: 0%| | 0.00/1.80k [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)/a4f8f3e/config.json: 100%|██████████| 1.80k/1.80k [00:00<00:00, 743kB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 0%| | 0.00/1.22G [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 1%| | 10.5M/1.22G [00:00<01:10, 17.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 2%|▏ | 21.0M/1.22G [00:01<01:31, 13.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 3%|▎ | 31.5M/1.22G [00:02<01:13, 16.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 3%|▎ | 41.9M/1.22G [00:02<01:16, 15.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 4%|▍ | 52.4M/1.22G [00:03<01:04, 18.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 5%|▌ | 62.9M/1.22G [00:03<01:10, 16.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 6%|▌ | 73.4M/1.22G [00:04<01:25, 13.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 7%|▋ | 83.9M/1.22G [00:05<01:26, 13.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 8%|▊ | 94.4M/1.22G [00:06<01:17, 14.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 9%|▊ | 105M/1.22G [00:06<01:07, 16.6MB/s] "
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 9%|▉ | 115M/1.22G [00:07<00:57, 19.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 10%|█ | 126M/1.22G [00:07<01:04, 16.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 11%|█ | 136M/1.22G [00:08<01:06, 16.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 12%|█▏ | 147M/1.22G [00:09<01:07, 16.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 13%|█▎ | 157M/1.22G [00:09<01:07, 15.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 14%|█▎ | 168M/1.22G [00:10<01:05, 16.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 15%|█▍ | 178M/1.22G [00:11<01:02, 16.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 15%|█▌ | 189M/1.22G [00:11<00:52, 19.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 16%|█▋ | 199M/1.22G [00:11<00:50, 20.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 17%|█▋ | 210M/1.22G [00:12<00:53, 18.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 18%|█▊ | 220M/1.22G [00:12<00:43, 23.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 19%|█▉ | 231M/1.22G [00:13<00:42, 23.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 20%|█▉ | 241M/1.22G [00:13<00:49, 19.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 21%|██ | 252M/1.22G [00:14<00:52, 18.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 21%|██▏ | 262M/1.22G [00:15<00:49, 19.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 22%|██▏ | 273M/1.22G [00:15<00:42, 22.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 23%|██▎ | 283M/1.22G [00:15<00:42, 22.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 24%|██▍ | 294M/1.22G [00:16<00:51, 18.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 25%|██▍ | 304M/1.22G [00:17<00:47, 19.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 26%|██▌ | 315M/1.22G [00:17<00:48, 18.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 27%|██▋ | 325M/1.22G [00:18<00:54, 16.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 27%|██▋ | 336M/1.22G [00:19<01:03, 13.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 28%|██▊ | 346M/1.22G [00:20<01:06, 13.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 29%|██▉ | 357M/1.22G [00:21<00:58, 14.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 30%|███ | 367M/1.22G [00:21<00:53, 15.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 31%|███ | 377M/1.22G [00:22<00:52, 16.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 32%|███▏ | 388M/1.22G [00:22<00:50, 16.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 33%|███▎ | 398M/1.22G [00:23<00:49, 16.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 33%|███▎ | 409M/1.22G [00:23<00:47, 17.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 34%|███▍ | 419M/1.22G [00:24<00:45, 17.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 35%|███▌ | 430M/1.22G [00:25<00:44, 17.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 36%|███▌ | 440M/1.22G [00:25<00:43, 17.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 37%|███▋ | 451M/1.22G [00:26<00:42, 18.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 38%|███▊ | 461M/1.22G [00:26<00:40, 18.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 39%|███▊ | 472M/1.22G [00:27<00:41, 18.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 39%|███▉ | 482M/1.22G [00:27<00:37, 19.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 40%|████ | 493M/1.22G [00:28<00:36, 20.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 41%|████ | 503M/1.22G [00:28<00:35, 20.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 42%|████▏ | 514M/1.22G [00:29<00:34, 20.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 43%|████▎ | 524M/1.22G [00:29<00:33, 21.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 44%|████▍ | 535M/1.22G [00:30<00:34, 20.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 45%|████▍ | 545M/1.22G [00:30<00:33, 20.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 45%|████▌ | 556M/1.22G [00:31<00:31, 20.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 46%|████▋ | 566M/1.22G [00:31<00:30, 21.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 47%|████▋ | 577M/1.22G [00:32<00:29, 22.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 48%|████▊ | 587M/1.22G [00:32<00:27, 22.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 49%|████▉ | 598M/1.22G [00:33<00:31, 20.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 50%|████▉ | 608M/1.22G [00:33<00:28, 21.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 51%|█████ | 619M/1.22G [00:34<00:28, 21.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 51%|█████▏ | 629M/1.22G [00:34<00:31, 19.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 52%|█████▏ | 640M/1.22G [00:35<00:28, 20.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 53%|█████▎ | 650M/1.22G [00:35<00:27, 21.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 54%|█████▍ | 661M/1.22G [00:36<00:27, 20.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 55%|█████▍ | 671M/1.22G [00:37<00:30, 18.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 56%|█████▌ | 682M/1.22G [00:37<00:27, 20.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 57%|█████▋ | 692M/1.22G [00:37<00:24, 21.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 57%|█████▋ | 703M/1.22G [00:38<00:22, 22.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 58%|█████▊ | 713M/1.22G [00:38<00:25, 19.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 59%|█████▉ | 724M/1.22G [00:39<00:22, 22.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 60%|██████ | 734M/1.22G [00:39<00:19, 24.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 61%|██████ | 744M/1.22G [00:40<00:21, 22.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 62%|██████▏ | 755M/1.22G [00:41<00:26, 17.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 63%|██████▎ | 765M/1.22G [00:41<00:24, 18.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 63%|██████▎ | 776M/1.22G [00:42<00:23, 19.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 64%|██████▍ | 786M/1.22G [00:42<00:20, 20.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 65%|██████▌ | 797M/1.22G [00:43<00:24, 17.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 66%|██████▌ | 807M/1.22G [00:43<00:19, 21.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 67%|██████▋ | 818M/1.22G [00:43<00:18, 22.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 68%|██████▊ | 828M/1.22G [00:44<00:15, 26.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 69%|██████▊ | 839M/1.22G [00:45<00:19, 19.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 69%|██████▉ | 849M/1.22G [00:45<00:19, 19.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 70%|███████ | 860M/1.22G [00:46<00:20, 18.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 71%|███████ | 870M/1.22G [00:47<00:22, 15.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 72%|███████▏ | 881M/1.22G [00:48<00:26, 12.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 73%|███████▎ | 891M/1.22G [00:48<00:20, 16.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 74%|███████▍ | 902M/1.22G [00:49<00:18, 17.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 75%|███████▍ | 912M/1.22G [00:49<00:15, 20.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 75%|███████▌ | 923M/1.22G [00:50<00:17, 16.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 76%|███████▋ | 933M/1.22G [00:50<00:17, 16.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 77%|███████▋ | 944M/1.22G [00:51<00:15, 17.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 78%|███████▊ | 954M/1.22G [00:52<00:15, 17.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 79%|███████▉ | 965M/1.22G [00:52<00:16, 15.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 80%|███████▉ | 975M/1.22G [00:53<00:13, 18.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 81%|████████ | 986M/1.22G [00:53<00:10, 22.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 81%|████████▏ | 996M/1.22G [00:54<00:10, 20.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 82%|████████▏ | 1.01G/1.22G [00:55<00:14, 14.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 83%|████████▎ | 1.02G/1.22G [00:55<00:13, 15.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 84%|████████▍ | 1.03G/1.22G [00:56<00:12, 16.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 85%|████████▍ | 1.04G/1.22G [00:57<00:11, 16.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 86%|████████▌ | 1.05G/1.22G [00:57<00:10, 16.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 87%|████████▋ | 1.06G/1.22G [00:58<00:10, 15.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 88%|████████▊ | 1.07G/1.22G [00:58<00:09, 16.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 88%|████████▊ | 1.08G/1.22G [00:59<00:07, 19.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 89%|████████▉ | 1.09G/1.22G [01:00<00:08, 16.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 90%|█████████ | 1.10G/1.22G [01:00<00:07, 15.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 91%|█████████ | 1.11G/1.22G [01:01<00:06, 18.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 92%|█████████▏| 1.12G/1.22G [01:01<00:04, 23.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 93%|█████████▎| 1.13G/1.22G [01:01<00:03, 23.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 94%|█████████▎| 1.14G/1.22G [01:01<00:02, 29.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 94%|█████████▍| 1.15G/1.22G [01:02<00:02, 32.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 95%|█████████▌| 1.16G/1.22G [01:02<00:01, 32.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 96%|█████████▌| 1.17G/1.22G [01:02<00:01, 37.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 97%|█████████▋| 1.18G/1.22G [01:03<00:01, 35.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 98%|█████████▊| 1.20G/1.22G [01:03<00:00, 35.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 99%|█████████▊| 1.21G/1.22G [01:03<00:00, 31.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 100%|█████████▉| 1.22G/1.22G [01:04<00:00, 27.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 100%|██████████| 1.22G/1.22G [01:04<00:00, 27.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 100%|██████████| 1.22G/1.22G [01:04<00:00, 18.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)okenizer_config.json: 0%| | 0.00/26.0 [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)okenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 22.4kB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)e/a4f8f3e/vocab.json: 0%| | 0.00/899k [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)e/a4f8f3e/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 4.47MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)e/a4f8f3e/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 4.44MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)e/a4f8f3e/merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)e/a4f8f3e/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 3.93MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)e/a4f8f3e/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 3.87MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)/af0f99b/config.json: 0%| | 0.00/629 [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)/af0f99b/config.json: 100%|██████████| 629/629 [00:00<00:00, 635kB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 0%| | 0.00/268M [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 4%|▍ | 10.5M/268M [00:00<00:06, 37.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 8%|▊ | 21.0M/268M [00:00<00:06, 38.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 12%|█▏ | 31.5M/268M [00:00<00:05, 41.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 16%|█▌ | 41.9M/268M [00:01<00:06, 33.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 20%|█▉ | 52.4M/268M [00:01<00:06, 32.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 23%|██▎ | 62.9M/268M [00:01<00:07, 28.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 27%|██▋ | 73.4M/268M [00:02<00:06, 28.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 31%|███▏ | 83.9M/268M [00:02<00:07, 25.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 35%|███▌ | 94.4M/268M [00:03<00:06, 28.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 39%|███▉ | 105M/268M [00:03<00:05, 29.7MB/s] "
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 43%|████▎ | 115M/268M [00:03<00:05, 30.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 47%|████▋ | 126M/268M [00:04<00:04, 29.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 51%|█████ | 136M/268M [00:04<00:03, 33.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 55%|█████▍ | 147M/268M [00:04<00:03, 34.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 59%|█████▊ | 157M/268M [00:04<00:03, 35.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 63%|██████▎ | 168M/268M [00:05<00:03, 27.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 67%|██████▋ | 178M/268M [00:05<00:03, 29.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 70%|███████ | 189M/268M [00:06<00:02, 34.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 74%|███████▍ | 199M/268M [00:06<00:01, 36.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 78%|███████▊ | 210M/268M [00:06<00:01, 34.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 82%|████████▏ | 220M/268M [00:07<00:01, 30.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 86%|████████▌ | 231M/268M [00:07<00:01, 32.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 90%|█████████ | 241M/268M [00:07<00:00, 37.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 94%|█████████▍| 252M/268M [00:07<00:00, 35.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 98%|█████████▊| 262M/268M [00:07<00:00, 42.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 100%|██████████| 268M/268M [00:08<00:00, 33.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)okenizer_config.json: 0%| | 0.00/48.0 [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)okenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 46.0kB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)ve/af0f99b/vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)ve/af0f99b/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 11.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)/f2482bf/config.json: 0%| | 0.00/998 [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)/f2482bf/config.json: 100%|██████████| 998/998 [00:00<00:00, 448kB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 0%| | 0.00/1.33G [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 1%| | 10.5M/1.33G [00:00<00:34, 37.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 2%|▏ | 21.0M/1.33G [00:00<00:38, 34.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 2%|▏ | 31.5M/1.33G [00:00<00:31, 41.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 3%|▎ | 41.9M/1.33G [00:01<00:45, 28.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 4%|▍ | 52.4M/1.33G [00:01<00:40, 31.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 5%|▍ | 62.9M/1.33G [00:01<00:38, 32.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 6%|▌ | 73.4M/1.33G [00:02<00:36, 35.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 6%|▋ | 83.9M/1.33G [00:02<00:40, 30.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 7%|▋ | 94.4M/1.33G [00:02<00:40, 30.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 8%|▊ | 105M/1.33G [00:03<00:40, 30.6MB/s] "
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 9%|▊ | 115M/1.33G [00:03<00:38, 32.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 9%|▉ | 126M/1.33G [00:04<00:47, 25.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 10%|█ | 136M/1.33G [00:04<00:42, 28.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 11%|█ | 147M/1.33G [00:04<00:38, 31.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 12%|█▏ | 157M/1.33G [00:04<00:32, 36.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 13%|█▎ | 168M/1.33G [00:05<00:37, 31.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 13%|█▎ | 178M/1.33G [00:05<00:32, 35.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 14%|█▍ | 189M/1.33G [00:05<00:30, 37.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 15%|█▍ | 199M/1.33G [00:06<00:30, 36.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 16%|█▌ | 210M/1.33G [00:06<00:30, 36.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 17%|█▋ | 220M/1.33G [00:06<00:29, 37.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 17%|█▋ | 231M/1.33G [00:06<00:31, 35.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 18%|█▊ | 241M/1.33G [00:07<00:32, 33.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 19%|█▉ | 252M/1.33G [00:07<00:36, 30.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 20%|█▉ | 262M/1.33G [00:08<00:34, 30.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 20%|██ | 273M/1.33G [00:08<00:34, 30.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 21%|██ | 283M/1.33G [00:08<00:30, 34.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 22%|██▏ | 294M/1.33G [00:09<00:33, 31.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 23%|██▎ | 304M/1.33G [00:09<00:33, 30.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 24%|██▎ | 315M/1.33G [00:09<00:33, 30.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 24%|██▍ | 325M/1.33G [00:09<00:30, 33.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 25%|██▌ | 336M/1.33G [00:10<00:34, 28.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 26%|██▌ | 346M/1.33G [00:10<00:30, 32.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 27%|██▋ | 357M/1.33G [00:11<00:32, 29.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 28%|██▊ | 367M/1.33G [00:11<00:30, 31.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 28%|██▊ | 377M/1.33G [00:12<00:41, 23.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 29%|██▉ | 388M/1.33G [00:12<00:39, 24.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 30%|██▉ | 398M/1.33G [00:12<00:35, 26.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 31%|███ | 409M/1.33G [00:13<00:31, 29.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 31%|███▏ | 419M/1.33G [00:13<00:29, 30.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 32%|███▏ | 430M/1.33G [00:13<00:26, 34.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 33%|███▎ | 440M/1.33G [00:13<00:23, 37.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 34%|███▍ | 451M/1.33G [00:14<00:23, 38.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 35%|███▍ | 461M/1.33G [00:14<00:24, 35.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 35%|███▌ | 472M/1.33G [00:14<00:23, 37.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 36%|███▌ | 482M/1.33G [00:14<00:23, 36.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 37%|███▋ | 493M/1.33G [00:15<00:27, 30.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 38%|███▊ | 503M/1.33G [00:15<00:30, 27.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 39%|███▊ | 514M/1.33G [00:16<00:27, 30.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 39%|███▉ | 524M/1.33G [00:16<00:22, 36.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 40%|████ | 535M/1.33G [00:16<00:21, 36.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 41%|████ | 545M/1.33G [00:17<00:27, 28.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 42%|████▏ | 556M/1.33G [00:17<00:24, 31.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 42%|████▏ | 566M/1.33G [00:17<00:21, 35.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 43%|████▎ | 577M/1.33G [00:18<00:24, 30.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 44%|████▍ | 587M/1.33G [00:18<00:28, 26.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 45%|████▍ | 598M/1.33G [00:18<00:26, 27.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 46%|████▌ | 608M/1.33G [00:19<00:26, 27.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 46%|████▋ | 619M/1.33G [00:19<00:24, 29.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 47%|████▋ | 629M/1.33G [00:20<00:28, 25.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 48%|████▊ | 640M/1.33G [00:20<00:26, 26.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 49%|████▊ | 650M/1.33G [00:20<00:23, 28.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 50%|████▉ | 661M/1.33G [00:21<00:25, 26.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 50%|█████ | 671M/1.33G [00:21<00:23, 28.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 51%|█████ | 682M/1.33G [00:21<00:22, 28.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 52%|█████▏ | 692M/1.33G [00:22<00:21, 29.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 53%|█████▎ | 703M/1.33G [00:22<00:20, 31.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 53%|█████▎ | 713M/1.33G [00:22<00:19, 31.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 54%|█████▍ | 724M/1.33G [00:23<00:18, 33.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 55%|█████▌ | 734M/1.33G [00:23<00:20, 30.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 56%|█████▌ | 744M/1.33G [00:24<00:20, 28.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 57%|█████▋ | 755M/1.33G [00:24<00:22, 25.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 57%|█████▋ | 765M/1.33G [00:24<00:20, 27.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 58%|█████▊ | 776M/1.33G [00:25<00:19, 28.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 59%|█████▉ | 786M/1.33G [00:25<00:17, 31.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 60%|█████▉ | 797M/1.33G [00:25<00:20, 26.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 61%|██████ | 807M/1.33G [00:26<00:16, 31.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 61%|██████▏ | 818M/1.33G [00:26<00:14, 34.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 62%|██████▏ | 828M/1.33G [00:26<00:15, 32.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 63%|██████▎ | 839M/1.33G [00:27<00:18, 27.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 64%|██████▎ | 849M/1.33G [00:27<00:15, 32.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 64%|██████▍ | 860M/1.33G [00:27<00:14, 32.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 65%|██████▌ | 870M/1.33G [00:28<00:14, 33.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 66%|██████▌ | 881M/1.33G [00:28<00:17, 26.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 67%|██████▋ | 891M/1.33G [00:29<00:17, 24.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 68%|██████▊ | 902M/1.33G [00:29<00:19, 22.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 68%|██████▊ | 912M/1.33G [00:30<00:16, 25.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 69%|██████▉ | 923M/1.33G [00:30<00:18, 22.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 70%|██████▉ | 933M/1.33G [00:30<00:15, 25.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 71%|███████ | 944M/1.33G [00:31<00:13, 30.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 72%|███████▏ | 954M/1.33G [00:31<00:12, 31.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 72%|███████▏ | 965M/1.33G [00:31<00:11, 31.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 73%|███████▎ | 975M/1.33G [00:32<00:11, 31.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 74%|███████▍ | 986M/1.33G [00:32<00:12, 28.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 75%|███████▍ | 996M/1.33G [00:32<00:11, 30.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 75%|███████▌ | 1.01G/1.33G [00:33<00:12, 25.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 76%|███████▌ | 1.02G/1.33G [00:33<00:11, 28.2MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 77%|███████▋ | 1.03G/1.33G [00:33<00:09, 31.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 78%|███████▊ | 1.04G/1.33G [00:34<00:09, 31.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 79%|███████▊ | 1.05G/1.33G [00:34<00:10, 27.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 79%|███████▉ | 1.06G/1.33G [00:35<00:10, 27.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 80%|████████ | 1.07G/1.33G [00:35<00:08, 32.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 81%|████████ | 1.08G/1.33G [00:36<00:10, 24.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 82%|████████▏ | 1.09G/1.33G [00:36<00:10, 23.5MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 83%|████████▎ | 1.10G/1.33G [00:36<00:08, 26.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 83%|████████▎ | 1.11G/1.33G [00:37<00:08, 25.9MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 84%|████████▍ | 1.12G/1.33G [00:37<00:07, 27.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 85%|████████▍ | 1.13G/1.33G [00:38<00:08, 23.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 86%|████████▌ | 1.14G/1.33G [00:38<00:08, 22.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 86%|████████▋ | 1.15G/1.33G [00:38<00:07, 24.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 87%|████████▋ | 1.16G/1.33G [00:39<00:06, 27.3MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 88%|████████▊ | 1.17G/1.33G [00:39<00:06, 23.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 89%|████████▉ | 1.18G/1.33G [00:39<00:05, 29.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 90%|████████▉ | 1.20G/1.33G [00:40<00:05, 27.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 91%|█████████ | 1.22G/1.33G [00:40<00:03, 33.1MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 92%|█████████▏| 1.23G/1.33G [00:41<00:03, 33.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 93%|█████████▎| 1.24G/1.33G [00:41<00:02, 33.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 94%|█████████▎| 1.25G/1.33G [00:41<00:02, 30.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 94%|█████████▍| 1.26G/1.33G [00:42<00:02, 30.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 95%|█████████▌| 1.27G/1.33G [00:42<00:02, 30.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 96%|█████████▌| 1.28G/1.33G [00:42<00:01, 32.7MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 97%|█████████▋| 1.29G/1.33G [00:43<00:01, 33.6MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 97%|█████████▋| 1.30G/1.33G [00:43<00:01, 29.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 98%|█████████▊| 1.31G/1.33G [00:43<00:00, 30.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 99%|█████████▉| 1.32G/1.33G [00:44<00:00, 26.4MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 100%|█████████▉| 1.33G/1.33G [00:44<00:00, 29.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading pytorch_model.bin: 100%|██████████| 1.33G/1.33G [00:44<00:00, 29.8MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)okenizer_config.json: 0%| | 0.00/60.0 [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)okenizer_config.json: 100%|██████████| 60.0/60.0 [00:00<00:00, 58.7kB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)ve/f2482bf/vocab.txt: 0%| | 0.00/213k [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"Downloading (…)ve/f2482bf/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 14.0MB/s]"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for key in mydict:\n",
|
|
" mydict[key] = ammico.text.TextDetector(\n",
|
|
" mydict[key], analyse_text=True\n",
|
|
" ).analyse_image()"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "3c063eda",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Convert to dataframe and write csv\n",
|
|
"These steps are required to convert the dictionary of dictionarys into a dictionary with lists, that can be converted into a pandas dataframe and exported to a csv file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "5709c2cd",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:03:16.707948Z",
|
|
"iopub.status.busy": "2023-06-28T07:03:16.707711Z",
|
|
"iopub.status.idle": "2023-06-28T07:03:16.715239Z",
|
|
"shell.execute_reply": "2023-06-28T07:03:16.714662Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"outdict = mutils.append_data_to_dict(mydict)\n",
|
|
"df = mutils.dump_df(outdict)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "ae182eb7",
|
|
"metadata": {},
|
|
"source": [
|
|
"Check the dataframe:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "c4f05637",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:03:16.717936Z",
|
|
"iopub.status.busy": "2023-06-28T07:03:16.717516Z",
|
|
"iopub.status.idle": "2023-06-28T07:03:16.737240Z",
|
|
"shell.execute_reply": "2023-06-28T07:03:16.736176Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>filename</th>\n",
|
|
" <th>text</th>\n",
|
|
" <th>text_language</th>\n",
|
|
" <th>text_english</th>\n",
|
|
" <th>text_summary</th>\n",
|
|
" <th>sentiment</th>\n",
|
|
" <th>sentiment_score</th>\n",
|
|
" <th>entity</th>\n",
|
|
" <th>entity_type</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>data/106349S_por.png</td>\n",
|
|
" <td>NEWS URGENTE SAMSUNG AO VIVO Rio de Janeiro NO...</td>\n",
|
|
" <td>pt</td>\n",
|
|
" <td>NEWS URGENT SAMSUNG LIVE Rio de Janeiro NEW CO...</td>\n",
|
|
" <td>NEW COUNTING METHOD RJ City HALL EXCLUDES 1,1...</td>\n",
|
|
" <td>NEGATIVE</td>\n",
|
|
" <td>0.99</td>\n",
|
|
" <td>[Rio de Janeiro, C, ##IT, P, ##NA, ##LTO]</td>\n",
|
|
" <td>[LOC, ORG, LOC, LOC, ORG, LOC]</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>data/102141_2_eng.png</td>\n",
|
|
" <td>CORONAVIRUS QUARANTINE CORONAVIRUS OUTBREAK BE...</td>\n",
|
|
" <td>en</td>\n",
|
|
" <td>CORONAVIRUS QUARANTINE CORONAVIRUS OUTBREAK BE...</td>\n",
|
|
" <td>Coronavirus QUARANTINE CORONAVIRUS OUTBREAK</td>\n",
|
|
" <td>NEGATIVE</td>\n",
|
|
" <td>0.98</td>\n",
|
|
" <td>[CORONAVIRUS, ##AR, ##TI, ##RONAVIR, ##C, Co]</td>\n",
|
|
" <td>[ORG, MISC, MISC, ORG, MISC, MISC]</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>data/102730_eng.png</td>\n",
|
|
" <td>400 DEATHS GET E-BOOK X AN Corporation ncy Ser...</td>\n",
|
|
" <td>en</td>\n",
|
|
" <td>400 DEATHS GET E-BOOK X AN Corporation ncy Ser...</td>\n",
|
|
" <td>A municipal worker sprays disinfectant on his...</td>\n",
|
|
" <td>NEGATIVE</td>\n",
|
|
" <td>0.99</td>\n",
|
|
" <td>[AN Corporation ncy Services, Ahmedabad, RE, #...</td>\n",
|
|
" <td>[ORG, LOC, PER, ORG]</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" filename text \\\n",
|
|
"0 data/106349S_por.png NEWS URGENTE SAMSUNG AO VIVO Rio de Janeiro NO... \n",
|
|
"1 data/102141_2_eng.png CORONAVIRUS QUARANTINE CORONAVIRUS OUTBREAK BE... \n",
|
|
"2 data/102730_eng.png 400 DEATHS GET E-BOOK X AN Corporation ncy Ser... \n",
|
|
"\n",
|
|
" text_language text_english \\\n",
|
|
"0 pt NEWS URGENT SAMSUNG LIVE Rio de Janeiro NEW CO... \n",
|
|
"1 en CORONAVIRUS QUARANTINE CORONAVIRUS OUTBREAK BE... \n",
|
|
"2 en 400 DEATHS GET E-BOOK X AN Corporation ncy Ser... \n",
|
|
"\n",
|
|
" text_summary sentiment \\\n",
|
|
"0 NEW COUNTING METHOD RJ City HALL EXCLUDES 1,1... NEGATIVE \n",
|
|
"1 Coronavirus QUARANTINE CORONAVIRUS OUTBREAK NEGATIVE \n",
|
|
"2 A municipal worker sprays disinfectant on his... NEGATIVE \n",
|
|
"\n",
|
|
" sentiment_score entity \\\n",
|
|
"0 0.99 [Rio de Janeiro, C, ##IT, P, ##NA, ##LTO] \n",
|
|
"1 0.98 [CORONAVIRUS, ##AR, ##TI, ##RONAVIR, ##C, Co] \n",
|
|
"2 0.99 [AN Corporation ncy Services, Ahmedabad, RE, #... \n",
|
|
"\n",
|
|
" entity_type \n",
|
|
"0 [LOC, ORG, LOC, LOC, ORG, LOC] \n",
|
|
"1 [ORG, MISC, MISC, ORG, MISC, MISC] \n",
|
|
"2 [ORG, LOC, PER, ORG] "
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "eedf1e47",
|
|
"metadata": {},
|
|
"source": [
|
|
"Write the csv file - here you should provide a file path and file name for the csv file to be written."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "bf6c9ddb",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:03:16.739813Z",
|
|
"iopub.status.busy": "2023-06-28T07:03:16.739473Z",
|
|
"iopub.status.idle": "2023-06-28T07:03:16.745547Z",
|
|
"shell.execute_reply": "2023-06-28T07:03:16.744988Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Write the csv\n",
|
|
"df.to_csv(\"./data_out.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "4bc8ac0a",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Topic analysis\n",
|
|
"The topic analysis is carried out using [BERTopic](https://maartengr.github.io/BERTopic/index.html) using an embedded model through a [spaCy](https://spacy.io/) pipeline."
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "4931941b",
|
|
"metadata": {},
|
|
"source": [
|
|
"BERTopic takes a list of strings as input. The more items in the list, the better for the topic modeling. If the below returns an error for `analyse_topic()`, the reason can be that your dataset is too small.\n",
|
|
"\n",
|
|
"You can pass which dataframe entry you would like to have analyzed. The default is `text_english`, but you could for example also select `text_summary` or `text_english_correct` setting the keyword `analyze_text` as so:\n",
|
|
"\n",
|
|
"`ammico.text.PostprocessText(mydict=mydict, analyze_text=\"text_summary\").analyse_topic()`\n",
|
|
"\n",
|
|
"### Option 1: Use the dictionary as obtained from the above analysis."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "a3450a61",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:03:16.748523Z",
|
|
"iopub.status.busy": "2023-06-28T07:03:16.748179Z",
|
|
"iopub.status.idle": "2023-06-28T07:03:29.951452Z",
|
|
"shell.execute_reply": "2023-06-28T07:03:29.950402Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Reading data from dict.\n",
|
|
"huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
|
|
"To disable this warning, you can either:\n",
|
|
"\t- Avoid using `tokenizers` before the fork if possible\n",
|
|
"\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Collecting en-core-web-md==3.5.0\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)\n",
|
|
"\u001b[?25l 0.0/42.8 MB ? eta -:--:--\r",
|
|
"\u001b[2K 0.1/42.8 MB 4.1 MB/s eta 0:00:11\r",
|
|
"\u001b[2K ╸ 0.7/42.8 MB 10.4 MB/s eta 0:00:05\r",
|
|
"\u001b[2K ━ 1.4/42.8 MB 13.2 MB/s eta 0:00:04\r",
|
|
"\u001b[2K ━━ 2.2/42.8 MB 15.8 MB/s eta 0:00:03\r",
|
|
"\u001b[2K ━━╸ 3.2/42.8 MB 18.4 MB/s eta 0:00:03"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"\u001b[2K ━━━━ 4.4/42.8 MB 20.9 MB/s eta 0:00:02\r",
|
|
"\u001b[2K ━━━━━ 5.8/42.8 MB 23.6 MB/s eta 0:00:02\r",
|
|
"\u001b[2K ━━━━━━╸ 7.3/42.8 MB 26.0 MB/s eta 0:00:02\r",
|
|
"\u001b[2K ━━━━━━━━╸ 9.4/42.8 MB 29.8 MB/s eta 0:00:02\r",
|
|
"\u001b[2K ━━━━━━━━━━━ 11.9/42.8 MB 43.4 MB/s eta 0:00:01\r",
|
|
"\u001b[2K ━━━━━━━━━━━━━╸ 15.0/42.8 MB 61.6 MB/s eta 0:00:01"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"\u001b[2K ━━━━━━━━━━━━━━━━━╸ 18.7/42.8 MB 85.7 MB/s eta 0:00:01\r",
|
|
"\u001b[2K ━━━━━━━━━━━━━━━━━━━━━ 23.4/42.8 MB 113.0 MB/s eta 0:00:01\r",
|
|
"\u001b[2K ━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 29.1/42.8 MB 149.6 MB/s eta 0:00:01\r",
|
|
"\u001b[2K ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 35.7/42.8 MB 178.0 MB/s eta 0:00:01\r",
|
|
"\u001b[2K ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 42.8/42.8 MB 204.4 MB/s eta 0:00:01\r",
|
|
"\u001b[2K ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 42.8/42.8 MB 204.4 MB/s eta 0:00:01"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\r",
|
|
"\u001b[2K ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 42.8/42.8 MB 204.4 MB/s eta 0:00:01\r",
|
|
"\u001b[2K ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 42.8/42.8 MB 204.4 MB/s eta 0:00:01\r",
|
|
"\u001b[2K ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.8/42.8 MB 58.4 MB/s eta 0:00:00\n",
|
|
"\u001b[?25hRequirement already satisfied: spacy<3.6.0,>=3.5.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from en-core-web-md==3.5.0) (3.5.3)\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (3.0.12)\n",
|
|
"Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (1.0.4)\n",
|
|
"Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (1.0.9)\n",
|
|
"Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2.0.7)\n",
|
|
"Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (3.0.8)\n",
|
|
"Requirement already satisfied: thinc<8.2.0,>=8.1.8 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (8.1.10)\n",
|
|
"Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (1.1.2)\n",
|
|
"Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2.4.6)\n",
|
|
"Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2.0.8)\n",
|
|
"Requirement already satisfied: typer<0.8.0,>=0.3.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (0.7.0)\n",
|
|
"Requirement already satisfied: pathy>=0.10.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (0.10.2)\n",
|
|
"Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (6.3.0)\n",
|
|
"Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (4.65.0)\n",
|
|
"Requirement already satisfied: numpy>=1.15.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (1.23.4)\n",
|
|
"Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2.31.0)\n",
|
|
"Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (1.10.9)\n",
|
|
"Requirement already satisfied: jinja2 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (3.1.2)\n",
|
|
"Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (58.1.0)\n",
|
|
"Requirement already satisfied: packaging>=20.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (23.1)\n",
|
|
"Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (3.3.0)\n",
|
|
"Requirement already satisfied: typing-extensions>=4.2.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (4.6.3)\n",
|
|
"Requirement already satisfied: charset-normalizer<4,>=2 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (3.1.0)\n",
|
|
"Requirement already satisfied: idna<4,>=2.5 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2.10)\n",
|
|
"Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (1.26.16)\n",
|
|
"Requirement already satisfied: certifi>=2017.4.17 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2023.5.7)\n",
|
|
"Requirement already satisfied: blis<0.8.0,>=0.7.8 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from thinc<8.2.0,>=8.1.8->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (0.7.9)\n",
|
|
"Requirement already satisfied: confection<1.0.0,>=0.0.1 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from thinc<8.2.0,>=8.1.8->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (0.0.4)\n",
|
|
"Requirement already satisfied: click<9.0.0,>=7.1.1 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from typer<0.8.0,>=0.3.0->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (8.1.3)\n",
|
|
"Requirement already satisfied: MarkupSafe>=2.0 in /opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages (from jinja2->spacy<3.6.0,>=3.5.0->en-core-web-md==3.5.0) (2.1.3)\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Installing collected packages: en-core-web-md\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Successfully installed en-core-web-md-3.5.0\n",
|
|
"\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
|
|
"You can now load the package via spacy.load('en_core_web_md')\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n",
|
|
"[notice] A new release of pip is available: 23.0.1 -> 23.1.2\n",
|
|
"[notice] To update, run: pip install --upgrade pip\n"
|
|
]
|
|
},
|
|
{
|
|
"ename": "TypeError",
|
|
"evalue": "Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/bertopic/_bertopic.py:2868\u001b[0m, in \u001b[0;36mBERTopic._reduce_dimensionality\u001b[0;34m(self, embeddings, y, partial_fit)\u001b[0m\n\u001b[1;32m 2867\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m-> 2868\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mumap_model\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfit\u001b[49m\u001b[43m(\u001b[49m\u001b[43membeddings\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43my\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2869\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2684\u001b[0m, in \u001b[0;36mUMAP.fit\u001b[0;34m(self, X, y)\u001b[0m\n\u001b[1;32m 2683\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtransform_mode \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124membedding\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[0;32m-> 2684\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39membedding_, aux_data \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_fit_embed_data\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 2685\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_raw_data\u001b[49m\u001b[43m[\u001b[49m\u001b[43mindex\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2686\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mn_epochs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2687\u001b[0m \u001b[43m \u001b[49m\u001b[43minit\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2688\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_state\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# JH why raw data?\u001b[39;49;00m\n\u001b[1;32m 2689\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2690\u001b[0m \u001b[38;5;66;03m# Assign any points that are fully disconnected from our manifold(s) to have embedding\u001b[39;00m\n\u001b[1;32m 2691\u001b[0m \u001b[38;5;66;03m# coordinates of np.nan. These will be filtered by our plotting functions automatically.\u001b[39;00m\n\u001b[1;32m 2692\u001b[0m \u001b[38;5;66;03m# They also prevent users from being deceived a distance query to one of these points.\u001b[39;00m\n\u001b[1;32m 2693\u001b[0m \u001b[38;5;66;03m# Might be worth moving this into simplicial_set_embedding or _fit_embed_data\u001b[39;00m\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2717\u001b[0m, in \u001b[0;36mUMAP._fit_embed_data\u001b[0;34m(self, X, n_epochs, init, random_state)\u001b[0m\n\u001b[1;32m 2714\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"A method wrapper for simplicial_set_embedding that can be\u001b[39;00m\n\u001b[1;32m 2715\u001b[0m \u001b[38;5;124;03mreplaced by subclasses.\u001b[39;00m\n\u001b[1;32m 2716\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m-> 2717\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43msimplicial_set_embedding\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 2718\u001b[0m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2719\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mgraph_\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2720\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mn_components\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2721\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_initial_alpha\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2722\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_a\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2723\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_b\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2724\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrepulsion_strength\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2725\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mnegative_sample_rate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2726\u001b[0m \u001b[43m \u001b[49m\u001b[43mn_epochs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2727\u001b[0m \u001b[43m \u001b[49m\u001b[43minit\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2728\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_state\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2729\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_input_distance_func\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2730\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_metric_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2731\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdensmap\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2732\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_densmap_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2733\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moutput_dens\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2734\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_output_distance_func\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2735\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_output_metric_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2736\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moutput_metric\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43meuclidean\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43ml2\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2737\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrandom_state\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mis\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 2738\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mverbose\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2739\u001b[0m \u001b[43m \u001b[49m\u001b[43mtqdm_kwds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtqdm_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2740\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:1078\u001b[0m, in \u001b[0;36msimplicial_set_embedding\u001b[0;34m(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)\u001b[0m\n\u001b[1;32m 1076\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(init, \u001b[38;5;28mstr\u001b[39m) \u001b[38;5;129;01mand\u001b[39;00m init \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mspectral\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[1;32m 1077\u001b[0m \u001b[38;5;66;03m# We add a little noise to avoid local minima for optimization to come\u001b[39;00m\n\u001b[0;32m-> 1078\u001b[0m initialisation \u001b[38;5;241m=\u001b[39m \u001b[43mspectral_layout\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 1079\u001b[0m \u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1080\u001b[0m \u001b[43m \u001b[49m\u001b[43mgraph\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1081\u001b[0m \u001b[43m \u001b[49m\u001b[43mn_components\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1082\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_state\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1083\u001b[0m \u001b[43m \u001b[49m\u001b[43mmetric\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmetric\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1084\u001b[0m \u001b[43m \u001b[49m\u001b[43mmetric_kwds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmetric_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1085\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1086\u001b[0m expansion \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m10.0\u001b[39m \u001b[38;5;241m/\u001b[39m np\u001b[38;5;241m.\u001b[39mabs(initialisation)\u001b[38;5;241m.\u001b[39mmax()\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/spectral.py:332\u001b[0m, in \u001b[0;36mspectral_layout\u001b[0;34m(data, graph, dim, random_state, metric, metric_kwds)\u001b[0m\n\u001b[1;32m 331\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m L\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m0\u001b[39m] \u001b[38;5;241m<\u001b[39m \u001b[38;5;241m2000000\u001b[39m:\n\u001b[0;32m--> 332\u001b[0m eigenvalues, eigenvectors \u001b[38;5;241m=\u001b[39m \u001b[43mscipy\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43msparse\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlinalg\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43meigsh\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 333\u001b[0m \u001b[43m \u001b[49m\u001b[43mL\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 334\u001b[0m \u001b[43m \u001b[49m\u001b[43mk\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 335\u001b[0m \u001b[43m \u001b[49m\u001b[43mwhich\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mSM\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 336\u001b[0m \u001b[43m \u001b[49m\u001b[43mncv\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mnum_lanczos_vectors\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 337\u001b[0m \u001b[43m \u001b[49m\u001b[43mtol\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m1e-4\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 338\u001b[0m \u001b[43m \u001b[49m\u001b[43mv0\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mones\u001b[49m\u001b[43m(\u001b[49m\u001b[43mL\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mshape\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 339\u001b[0m \u001b[43m \u001b[49m\u001b[43mmaxiter\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgraph\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mshape\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;241;43m5\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 340\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 341\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1605\u001b[0m, in \u001b[0;36meigsh\u001b[0;34m(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)\u001b[0m\n\u001b[1;32m 1604\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m issparse(A):\n\u001b[0;32m-> 1605\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCannot use scipy.linalg.eigh for sparse A with \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1606\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mk >= N. Use scipy.linalg.eigh(A.toarray()) or\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1607\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m reduce k.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 1608\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(A, LinearOperator):\n",
|
|
"\u001b[0;31mTypeError\u001b[0m: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.",
|
|
"\nDuring handling of the above exception, another exception occurred:\n",
|
|
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[10], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# make a list of all the text_english entries per analysed image from the mydict variable as above\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m topic_model, topic_df, most_frequent_topics \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPostprocessText\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43mmydict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmydict\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43manalyse_topic\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
|
|
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:221\u001b[0m, in \u001b[0;36mPostprocessText.analyse_topic\u001b[0;34m(self, return_topics)\u001b[0m\n\u001b[1;32m 219\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[1;32m 220\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mBERTopic excited with an error - maybe your dataset is too small?\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m--> 221\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtopics, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mprobs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtopic_model\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfit_transform\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlist_text_english\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 222\u001b[0m \u001b[38;5;66;03m# return the topic list\u001b[39;00m\n\u001b[1;32m 223\u001b[0m topic_df \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtopic_model\u001b[38;5;241m.\u001b[39mget_topic_info()\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/bertopic/_bertopic.py:356\u001b[0m, in \u001b[0;36mBERTopic.fit_transform\u001b[0;34m(self, documents, embeddings, y)\u001b[0m\n\u001b[1;32m 354\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mseed_topic_list \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39membedding_model \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 355\u001b[0m y, embeddings \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_guided_topic_modeling(embeddings)\n\u001b[0;32m--> 356\u001b[0m umap_embeddings \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_reduce_dimensionality\u001b[49m\u001b[43m(\u001b[49m\u001b[43membeddings\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 358\u001b[0m \u001b[38;5;66;03m# Cluster reduced embeddings\u001b[39;00m\n\u001b[1;32m 359\u001b[0m documents, probabilities \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_cluster_embeddings(umap_embeddings, documents, y\u001b[38;5;241m=\u001b[39my)\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/bertopic/_bertopic.py:2872\u001b[0m, in \u001b[0;36mBERTopic._reduce_dimensionality\u001b[0;34m(self, embeddings, y, partial_fit)\u001b[0m\n\u001b[1;32m 2869\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[1;32m 2870\u001b[0m logger\u001b[38;5;241m.\u001b[39minfo(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mThe dimensionality reduction algorithm did not contain the `y` parameter and\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 2871\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m therefore the `y` parameter was not used\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m-> 2872\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mumap_model\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfit\u001b[49m\u001b[43m(\u001b[49m\u001b[43membeddings\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2874\u001b[0m umap_embeddings \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mumap_model\u001b[38;5;241m.\u001b[39mtransform(embeddings)\n\u001b[1;32m 2875\u001b[0m logger\u001b[38;5;241m.\u001b[39minfo(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReduced dimensionality\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2684\u001b[0m, in \u001b[0;36mUMAP.fit\u001b[0;34m(self, X, y)\u001b[0m\n\u001b[1;32m 2681\u001b[0m \u001b[38;5;28mprint\u001b[39m(ts(), \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mConstruct embedding\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 2683\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtransform_mode \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124membedding\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[0;32m-> 2684\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39membedding_, aux_data \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_fit_embed_data\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 2685\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_raw_data\u001b[49m\u001b[43m[\u001b[49m\u001b[43mindex\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2686\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mn_epochs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2687\u001b[0m \u001b[43m \u001b[49m\u001b[43minit\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2688\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_state\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# JH why raw data?\u001b[39;49;00m\n\u001b[1;32m 2689\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2690\u001b[0m \u001b[38;5;66;03m# Assign any points that are fully disconnected from our manifold(s) to have embedding\u001b[39;00m\n\u001b[1;32m 2691\u001b[0m \u001b[38;5;66;03m# coordinates of np.nan. These will be filtered by our plotting functions automatically.\u001b[39;00m\n\u001b[1;32m 2692\u001b[0m \u001b[38;5;66;03m# They also prevent users from being deceived a distance query to one of these points.\u001b[39;00m\n\u001b[1;32m 2693\u001b[0m \u001b[38;5;66;03m# Might be worth moving this into simplicial_set_embedding or _fit_embed_data\u001b[39;00m\n\u001b[1;32m 2694\u001b[0m disconnected_vertices \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39marray(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mgraph_\u001b[38;5;241m.\u001b[39msum(axis\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1\u001b[39m))\u001b[38;5;241m.\u001b[39mflatten() \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m0\u001b[39m\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2717\u001b[0m, in \u001b[0;36mUMAP._fit_embed_data\u001b[0;34m(self, X, n_epochs, init, random_state)\u001b[0m\n\u001b[1;32m 2713\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_fit_embed_data\u001b[39m(\u001b[38;5;28mself\u001b[39m, X, n_epochs, init, random_state):\n\u001b[1;32m 2714\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"A method wrapper for simplicial_set_embedding that can be\u001b[39;00m\n\u001b[1;32m 2715\u001b[0m \u001b[38;5;124;03m replaced by subclasses.\u001b[39;00m\n\u001b[1;32m 2716\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m-> 2717\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43msimplicial_set_embedding\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 2718\u001b[0m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2719\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mgraph_\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2720\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mn_components\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2721\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_initial_alpha\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2722\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_a\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2723\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_b\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2724\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrepulsion_strength\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2725\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mnegative_sample_rate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2726\u001b[0m \u001b[43m \u001b[49m\u001b[43mn_epochs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2727\u001b[0m \u001b[43m \u001b[49m\u001b[43minit\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2728\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_state\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2729\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_input_distance_func\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2730\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_metric_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2731\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdensmap\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2732\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_densmap_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2733\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moutput_dens\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2734\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_output_distance_func\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2735\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_output_metric_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2736\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moutput_metric\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43meuclidean\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43ml2\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2737\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrandom_state\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mis\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 2738\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mverbose\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2739\u001b[0m \u001b[43m \u001b[49m\u001b[43mtqdm_kwds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtqdm_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2740\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:1078\u001b[0m, in \u001b[0;36msimplicial_set_embedding\u001b[0;34m(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)\u001b[0m\n\u001b[1;32m 1073\u001b[0m embedding \u001b[38;5;241m=\u001b[39m random_state\u001b[38;5;241m.\u001b[39muniform(\n\u001b[1;32m 1074\u001b[0m low\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m-\u001b[39m\u001b[38;5;241m10.0\u001b[39m, high\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m10.0\u001b[39m, size\u001b[38;5;241m=\u001b[39m(graph\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m0\u001b[39m], n_components)\n\u001b[1;32m 1075\u001b[0m )\u001b[38;5;241m.\u001b[39mastype(np\u001b[38;5;241m.\u001b[39mfloat32)\n\u001b[1;32m 1076\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(init, \u001b[38;5;28mstr\u001b[39m) \u001b[38;5;129;01mand\u001b[39;00m init \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mspectral\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[1;32m 1077\u001b[0m \u001b[38;5;66;03m# We add a little noise to avoid local minima for optimization to come\u001b[39;00m\n\u001b[0;32m-> 1078\u001b[0m initialisation \u001b[38;5;241m=\u001b[39m \u001b[43mspectral_layout\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 1079\u001b[0m \u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1080\u001b[0m \u001b[43m \u001b[49m\u001b[43mgraph\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1081\u001b[0m \u001b[43m \u001b[49m\u001b[43mn_components\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1082\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_state\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1083\u001b[0m \u001b[43m \u001b[49m\u001b[43mmetric\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmetric\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1084\u001b[0m \u001b[43m \u001b[49m\u001b[43mmetric_kwds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmetric_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1085\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1086\u001b[0m expansion \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m10.0\u001b[39m \u001b[38;5;241m/\u001b[39m np\u001b[38;5;241m.\u001b[39mabs(initialisation)\u001b[38;5;241m.\u001b[39mmax()\n\u001b[1;32m 1087\u001b[0m embedding \u001b[38;5;241m=\u001b[39m (initialisation \u001b[38;5;241m*\u001b[39m expansion)\u001b[38;5;241m.\u001b[39mastype(\n\u001b[1;32m 1088\u001b[0m np\u001b[38;5;241m.\u001b[39mfloat32\n\u001b[1;32m 1089\u001b[0m ) \u001b[38;5;241m+\u001b[39m random_state\u001b[38;5;241m.\u001b[39mnormal(\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 1092\u001b[0m np\u001b[38;5;241m.\u001b[39mfloat32\n\u001b[1;32m 1093\u001b[0m )\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/spectral.py:332\u001b[0m, in \u001b[0;36mspectral_layout\u001b[0;34m(data, graph, dim, random_state, metric, metric_kwds)\u001b[0m\n\u001b[1;32m 330\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 331\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m L\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m0\u001b[39m] \u001b[38;5;241m<\u001b[39m \u001b[38;5;241m2000000\u001b[39m:\n\u001b[0;32m--> 332\u001b[0m eigenvalues, eigenvectors \u001b[38;5;241m=\u001b[39m \u001b[43mscipy\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43msparse\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlinalg\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43meigsh\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 333\u001b[0m \u001b[43m \u001b[49m\u001b[43mL\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 334\u001b[0m \u001b[43m \u001b[49m\u001b[43mk\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 335\u001b[0m \u001b[43m \u001b[49m\u001b[43mwhich\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mSM\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 336\u001b[0m \u001b[43m \u001b[49m\u001b[43mncv\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mnum_lanczos_vectors\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 337\u001b[0m \u001b[43m \u001b[49m\u001b[43mtol\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m1e-4\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 338\u001b[0m \u001b[43m \u001b[49m\u001b[43mv0\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mones\u001b[49m\u001b[43m(\u001b[49m\u001b[43mL\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mshape\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 339\u001b[0m \u001b[43m \u001b[49m\u001b[43mmaxiter\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgraph\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mshape\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;241;43m5\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 340\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 341\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 342\u001b[0m eigenvalues, eigenvectors \u001b[38;5;241m=\u001b[39m scipy\u001b[38;5;241m.\u001b[39msparse\u001b[38;5;241m.\u001b[39mlinalg\u001b[38;5;241m.\u001b[39mlobpcg(\n\u001b[1;32m 343\u001b[0m L, random_state\u001b[38;5;241m.\u001b[39mnormal(size\u001b[38;5;241m=\u001b[39m(L\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m0\u001b[39m], k)), largest\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m, tol\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1e-8\u001b[39m\n\u001b[1;32m 344\u001b[0m )\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1605\u001b[0m, in \u001b[0;36meigsh\u001b[0;34m(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)\u001b[0m\n\u001b[1;32m 1600\u001b[0m warnings\u001b[38;5;241m.\u001b[39mwarn(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mk >= N for N * N square matrix. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1601\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAttempting to use scipy.linalg.eigh instead.\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 1602\u001b[0m \u001b[38;5;167;01mRuntimeWarning\u001b[39;00m)\n\u001b[1;32m 1604\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m issparse(A):\n\u001b[0;32m-> 1605\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCannot use scipy.linalg.eigh for sparse A with \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1606\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mk >= N. Use scipy.linalg.eigh(A.toarray()) or\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1607\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m reduce k.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 1608\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(A, LinearOperator):\n\u001b[1;32m 1609\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCannot use scipy.linalg.eigh for LinearOperator \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1610\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mA with k >= N.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
|
|
"\u001b[0;31mTypeError\u001b[0m: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k."
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# make a list of all the text_english entries per analysed image from the mydict variable as above\n",
|
|
"topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
|
|
" mydict=mydict\n",
|
|
").analyse_topic()"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "95667342",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Option 2: Read in a csv\n",
|
|
"Not to analyse too many images on google Cloud Vision, use the csv output to obtain the text (when rerunning already analysed images)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"id": "5530e436",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:03:29.955500Z",
|
|
"iopub.status.busy": "2023-06-28T07:03:29.954960Z",
|
|
"iopub.status.idle": "2023-06-28T07:03:31.423205Z",
|
|
"shell.execute_reply": "2023-06-28T07:03:31.422255Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Reading data from df.\n"
|
|
]
|
|
},
|
|
{
|
|
"ename": "TypeError",
|
|
"evalue": "Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/bertopic/_bertopic.py:2868\u001b[0m, in \u001b[0;36mBERTopic._reduce_dimensionality\u001b[0;34m(self, embeddings, y, partial_fit)\u001b[0m\n\u001b[1;32m 2867\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m-> 2868\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mumap_model\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfit\u001b[49m\u001b[43m(\u001b[49m\u001b[43membeddings\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43my\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2869\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2684\u001b[0m, in \u001b[0;36mUMAP.fit\u001b[0;34m(self, X, y)\u001b[0m\n\u001b[1;32m 2683\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtransform_mode \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124membedding\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[0;32m-> 2684\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39membedding_, aux_data \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_fit_embed_data\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 2685\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_raw_data\u001b[49m\u001b[43m[\u001b[49m\u001b[43mindex\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2686\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mn_epochs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2687\u001b[0m \u001b[43m \u001b[49m\u001b[43minit\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2688\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_state\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# JH why raw data?\u001b[39;49;00m\n\u001b[1;32m 2689\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2690\u001b[0m \u001b[38;5;66;03m# Assign any points that are fully disconnected from our manifold(s) to have embedding\u001b[39;00m\n\u001b[1;32m 2691\u001b[0m \u001b[38;5;66;03m# coordinates of np.nan. These will be filtered by our plotting functions automatically.\u001b[39;00m\n\u001b[1;32m 2692\u001b[0m \u001b[38;5;66;03m# They also prevent users from being deceived a distance query to one of these points.\u001b[39;00m\n\u001b[1;32m 2693\u001b[0m \u001b[38;5;66;03m# Might be worth moving this into simplicial_set_embedding or _fit_embed_data\u001b[39;00m\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2717\u001b[0m, in \u001b[0;36mUMAP._fit_embed_data\u001b[0;34m(self, X, n_epochs, init, random_state)\u001b[0m\n\u001b[1;32m 2714\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"A method wrapper for simplicial_set_embedding that can be\u001b[39;00m\n\u001b[1;32m 2715\u001b[0m \u001b[38;5;124;03mreplaced by subclasses.\u001b[39;00m\n\u001b[1;32m 2716\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m-> 2717\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43msimplicial_set_embedding\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 2718\u001b[0m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2719\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mgraph_\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2720\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mn_components\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2721\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_initial_alpha\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2722\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_a\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2723\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_b\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2724\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrepulsion_strength\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2725\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mnegative_sample_rate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2726\u001b[0m \u001b[43m \u001b[49m\u001b[43mn_epochs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2727\u001b[0m \u001b[43m \u001b[49m\u001b[43minit\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2728\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_state\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2729\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_input_distance_func\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2730\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_metric_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2731\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdensmap\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2732\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_densmap_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2733\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moutput_dens\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2734\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_output_distance_func\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2735\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_output_metric_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2736\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moutput_metric\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43meuclidean\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43ml2\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2737\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrandom_state\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mis\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 2738\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mverbose\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2739\u001b[0m \u001b[43m \u001b[49m\u001b[43mtqdm_kwds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtqdm_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2740\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:1078\u001b[0m, in \u001b[0;36msimplicial_set_embedding\u001b[0;34m(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)\u001b[0m\n\u001b[1;32m 1076\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(init, \u001b[38;5;28mstr\u001b[39m) \u001b[38;5;129;01mand\u001b[39;00m init \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mspectral\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[1;32m 1077\u001b[0m \u001b[38;5;66;03m# We add a little noise to avoid local minima for optimization to come\u001b[39;00m\n\u001b[0;32m-> 1078\u001b[0m initialisation \u001b[38;5;241m=\u001b[39m \u001b[43mspectral_layout\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 1079\u001b[0m \u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1080\u001b[0m \u001b[43m \u001b[49m\u001b[43mgraph\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1081\u001b[0m \u001b[43m \u001b[49m\u001b[43mn_components\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1082\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_state\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1083\u001b[0m \u001b[43m \u001b[49m\u001b[43mmetric\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmetric\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1084\u001b[0m \u001b[43m \u001b[49m\u001b[43mmetric_kwds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmetric_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1085\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1086\u001b[0m expansion \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m10.0\u001b[39m \u001b[38;5;241m/\u001b[39m np\u001b[38;5;241m.\u001b[39mabs(initialisation)\u001b[38;5;241m.\u001b[39mmax()\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/spectral.py:332\u001b[0m, in \u001b[0;36mspectral_layout\u001b[0;34m(data, graph, dim, random_state, metric, metric_kwds)\u001b[0m\n\u001b[1;32m 331\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m L\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m0\u001b[39m] \u001b[38;5;241m<\u001b[39m \u001b[38;5;241m2000000\u001b[39m:\n\u001b[0;32m--> 332\u001b[0m eigenvalues, eigenvectors \u001b[38;5;241m=\u001b[39m \u001b[43mscipy\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43msparse\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlinalg\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43meigsh\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 333\u001b[0m \u001b[43m \u001b[49m\u001b[43mL\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 334\u001b[0m \u001b[43m \u001b[49m\u001b[43mk\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 335\u001b[0m \u001b[43m \u001b[49m\u001b[43mwhich\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mSM\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 336\u001b[0m \u001b[43m \u001b[49m\u001b[43mncv\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mnum_lanczos_vectors\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 337\u001b[0m \u001b[43m \u001b[49m\u001b[43mtol\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m1e-4\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 338\u001b[0m \u001b[43m \u001b[49m\u001b[43mv0\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mones\u001b[49m\u001b[43m(\u001b[49m\u001b[43mL\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mshape\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 339\u001b[0m \u001b[43m \u001b[49m\u001b[43mmaxiter\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgraph\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mshape\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;241;43m5\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 340\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 341\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1605\u001b[0m, in \u001b[0;36meigsh\u001b[0;34m(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)\u001b[0m\n\u001b[1;32m 1604\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m issparse(A):\n\u001b[0;32m-> 1605\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCannot use scipy.linalg.eigh for sparse A with \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1606\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mk >= N. Use scipy.linalg.eigh(A.toarray()) or\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1607\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m reduce k.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 1608\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(A, LinearOperator):\n",
|
|
"\u001b[0;31mTypeError\u001b[0m: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.",
|
|
"\nDuring handling of the above exception, another exception occurred:\n",
|
|
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[11], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m input_file_path \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdata_out.csv\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m----> 2\u001b[0m topic_model, topic_df, most_frequent_topics \u001b[38;5;241m=\u001b[39m \u001b[43mammico\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPostprocessText\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43muse_csv\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcsv_path\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minput_file_path\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43manalyse_topic\u001b[49m\u001b[43m(\u001b[49m\u001b[43mreturn_topics\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m10\u001b[39;49m\u001b[43m)\u001b[49m\n",
|
|
"File \u001b[0;32m~/work/AMMICO/AMMICO/ammico/text.py:221\u001b[0m, in \u001b[0;36mPostprocessText.analyse_topic\u001b[0;34m(self, return_topics)\u001b[0m\n\u001b[1;32m 219\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[1;32m 220\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mBERTopic excited with an error - maybe your dataset is too small?\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m--> 221\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtopics, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mprobs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtopic_model\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfit_transform\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlist_text_english\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 222\u001b[0m \u001b[38;5;66;03m# return the topic list\u001b[39;00m\n\u001b[1;32m 223\u001b[0m topic_df \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtopic_model\u001b[38;5;241m.\u001b[39mget_topic_info()\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/bertopic/_bertopic.py:356\u001b[0m, in \u001b[0;36mBERTopic.fit_transform\u001b[0;34m(self, documents, embeddings, y)\u001b[0m\n\u001b[1;32m 354\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mseed_topic_list \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39membedding_model \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 355\u001b[0m y, embeddings \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_guided_topic_modeling(embeddings)\n\u001b[0;32m--> 356\u001b[0m umap_embeddings \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_reduce_dimensionality\u001b[49m\u001b[43m(\u001b[49m\u001b[43membeddings\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 358\u001b[0m \u001b[38;5;66;03m# Cluster reduced embeddings\u001b[39;00m\n\u001b[1;32m 359\u001b[0m documents, probabilities \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_cluster_embeddings(umap_embeddings, documents, y\u001b[38;5;241m=\u001b[39my)\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/bertopic/_bertopic.py:2872\u001b[0m, in \u001b[0;36mBERTopic._reduce_dimensionality\u001b[0;34m(self, embeddings, y, partial_fit)\u001b[0m\n\u001b[1;32m 2869\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[1;32m 2870\u001b[0m logger\u001b[38;5;241m.\u001b[39minfo(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mThe dimensionality reduction algorithm did not contain the `y` parameter and\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 2871\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m therefore the `y` parameter was not used\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m-> 2872\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mumap_model\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfit\u001b[49m\u001b[43m(\u001b[49m\u001b[43membeddings\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2874\u001b[0m umap_embeddings \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mumap_model\u001b[38;5;241m.\u001b[39mtransform(embeddings)\n\u001b[1;32m 2875\u001b[0m logger\u001b[38;5;241m.\u001b[39minfo(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReduced dimensionality\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2684\u001b[0m, in \u001b[0;36mUMAP.fit\u001b[0;34m(self, X, y)\u001b[0m\n\u001b[1;32m 2681\u001b[0m \u001b[38;5;28mprint\u001b[39m(ts(), \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mConstruct embedding\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 2683\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtransform_mode \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124membedding\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[0;32m-> 2684\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39membedding_, aux_data \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_fit_embed_data\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 2685\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_raw_data\u001b[49m\u001b[43m[\u001b[49m\u001b[43mindex\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2686\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mn_epochs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2687\u001b[0m \u001b[43m \u001b[49m\u001b[43minit\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2688\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_state\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# JH why raw data?\u001b[39;49;00m\n\u001b[1;32m 2689\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2690\u001b[0m \u001b[38;5;66;03m# Assign any points that are fully disconnected from our manifold(s) to have embedding\u001b[39;00m\n\u001b[1;32m 2691\u001b[0m \u001b[38;5;66;03m# coordinates of np.nan. These will be filtered by our plotting functions automatically.\u001b[39;00m\n\u001b[1;32m 2692\u001b[0m \u001b[38;5;66;03m# They also prevent users from being deceived a distance query to one of these points.\u001b[39;00m\n\u001b[1;32m 2693\u001b[0m \u001b[38;5;66;03m# Might be worth moving this into simplicial_set_embedding or _fit_embed_data\u001b[39;00m\n\u001b[1;32m 2694\u001b[0m disconnected_vertices \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39marray(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mgraph_\u001b[38;5;241m.\u001b[39msum(axis\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1\u001b[39m))\u001b[38;5;241m.\u001b[39mflatten() \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m0\u001b[39m\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:2717\u001b[0m, in \u001b[0;36mUMAP._fit_embed_data\u001b[0;34m(self, X, n_epochs, init, random_state)\u001b[0m\n\u001b[1;32m 2713\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_fit_embed_data\u001b[39m(\u001b[38;5;28mself\u001b[39m, X, n_epochs, init, random_state):\n\u001b[1;32m 2714\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"A method wrapper for simplicial_set_embedding that can be\u001b[39;00m\n\u001b[1;32m 2715\u001b[0m \u001b[38;5;124;03m replaced by subclasses.\u001b[39;00m\n\u001b[1;32m 2716\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m-> 2717\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43msimplicial_set_embedding\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 2718\u001b[0m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2719\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mgraph_\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2720\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mn_components\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2721\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_initial_alpha\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2722\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_a\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2723\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_b\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2724\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrepulsion_strength\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2725\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mnegative_sample_rate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2726\u001b[0m \u001b[43m \u001b[49m\u001b[43mn_epochs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2727\u001b[0m \u001b[43m \u001b[49m\u001b[43minit\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2728\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_state\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2729\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_input_distance_func\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2730\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_metric_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2731\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdensmap\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2732\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_densmap_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2733\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moutput_dens\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2734\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_output_distance_func\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2735\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_output_metric_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2736\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43moutput_metric\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43meuclidean\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43ml2\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2737\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrandom_state\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mis\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 2738\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mverbose\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2739\u001b[0m \u001b[43m \u001b[49m\u001b[43mtqdm_kwds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtqdm_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2740\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/umap_.py:1078\u001b[0m, in \u001b[0;36msimplicial_set_embedding\u001b[0;34m(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)\u001b[0m\n\u001b[1;32m 1073\u001b[0m embedding \u001b[38;5;241m=\u001b[39m random_state\u001b[38;5;241m.\u001b[39muniform(\n\u001b[1;32m 1074\u001b[0m low\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m-\u001b[39m\u001b[38;5;241m10.0\u001b[39m, high\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m10.0\u001b[39m, size\u001b[38;5;241m=\u001b[39m(graph\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m0\u001b[39m], n_components)\n\u001b[1;32m 1075\u001b[0m )\u001b[38;5;241m.\u001b[39mastype(np\u001b[38;5;241m.\u001b[39mfloat32)\n\u001b[1;32m 1076\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(init, \u001b[38;5;28mstr\u001b[39m) \u001b[38;5;129;01mand\u001b[39;00m init \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mspectral\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[1;32m 1077\u001b[0m \u001b[38;5;66;03m# We add a little noise to avoid local minima for optimization to come\u001b[39;00m\n\u001b[0;32m-> 1078\u001b[0m initialisation \u001b[38;5;241m=\u001b[39m \u001b[43mspectral_layout\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 1079\u001b[0m \u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1080\u001b[0m \u001b[43m \u001b[49m\u001b[43mgraph\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1081\u001b[0m \u001b[43m \u001b[49m\u001b[43mn_components\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1082\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_state\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1083\u001b[0m \u001b[43m \u001b[49m\u001b[43mmetric\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmetric\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1084\u001b[0m \u001b[43m \u001b[49m\u001b[43mmetric_kwds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmetric_kwds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1085\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1086\u001b[0m expansion \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m10.0\u001b[39m \u001b[38;5;241m/\u001b[39m np\u001b[38;5;241m.\u001b[39mabs(initialisation)\u001b[38;5;241m.\u001b[39mmax()\n\u001b[1;32m 1087\u001b[0m embedding \u001b[38;5;241m=\u001b[39m (initialisation \u001b[38;5;241m*\u001b[39m expansion)\u001b[38;5;241m.\u001b[39mastype(\n\u001b[1;32m 1088\u001b[0m np\u001b[38;5;241m.\u001b[39mfloat32\n\u001b[1;32m 1089\u001b[0m ) \u001b[38;5;241m+\u001b[39m random_state\u001b[38;5;241m.\u001b[39mnormal(\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 1092\u001b[0m np\u001b[38;5;241m.\u001b[39mfloat32\n\u001b[1;32m 1093\u001b[0m )\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/umap/spectral.py:332\u001b[0m, in \u001b[0;36mspectral_layout\u001b[0;34m(data, graph, dim, random_state, metric, metric_kwds)\u001b[0m\n\u001b[1;32m 330\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 331\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m L\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m0\u001b[39m] \u001b[38;5;241m<\u001b[39m \u001b[38;5;241m2000000\u001b[39m:\n\u001b[0;32m--> 332\u001b[0m eigenvalues, eigenvectors \u001b[38;5;241m=\u001b[39m \u001b[43mscipy\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43msparse\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlinalg\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43meigsh\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 333\u001b[0m \u001b[43m \u001b[49m\u001b[43mL\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 334\u001b[0m \u001b[43m \u001b[49m\u001b[43mk\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 335\u001b[0m \u001b[43m \u001b[49m\u001b[43mwhich\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mSM\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 336\u001b[0m \u001b[43m \u001b[49m\u001b[43mncv\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mnum_lanczos_vectors\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 337\u001b[0m \u001b[43m \u001b[49m\u001b[43mtol\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m1e-4\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 338\u001b[0m \u001b[43m \u001b[49m\u001b[43mv0\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mones\u001b[49m\u001b[43m(\u001b[49m\u001b[43mL\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mshape\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 339\u001b[0m \u001b[43m \u001b[49m\u001b[43mmaxiter\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgraph\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mshape\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;241;43m5\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 340\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 341\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 342\u001b[0m eigenvalues, eigenvectors \u001b[38;5;241m=\u001b[39m scipy\u001b[38;5;241m.\u001b[39msparse\u001b[38;5;241m.\u001b[39mlinalg\u001b[38;5;241m.\u001b[39mlobpcg(\n\u001b[1;32m 343\u001b[0m L, random_state\u001b[38;5;241m.\u001b[39mnormal(size\u001b[38;5;241m=\u001b[39m(L\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m0\u001b[39m], k)), largest\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m, tol\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1e-8\u001b[39m\n\u001b[1;32m 344\u001b[0m )\n",
|
|
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1605\u001b[0m, in \u001b[0;36meigsh\u001b[0;34m(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)\u001b[0m\n\u001b[1;32m 1600\u001b[0m warnings\u001b[38;5;241m.\u001b[39mwarn(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mk >= N for N * N square matrix. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1601\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAttempting to use scipy.linalg.eigh instead.\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 1602\u001b[0m \u001b[38;5;167;01mRuntimeWarning\u001b[39;00m)\n\u001b[1;32m 1604\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m issparse(A):\n\u001b[0;32m-> 1605\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCannot use scipy.linalg.eigh for sparse A with \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1606\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mk >= N. Use scipy.linalg.eigh(A.toarray()) or\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1607\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m reduce k.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 1608\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(A, LinearOperator):\n\u001b[1;32m 1609\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCannot use scipy.linalg.eigh for LinearOperator \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1610\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mA with k >= N.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
|
|
"\u001b[0;31mTypeError\u001b[0m: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k."
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"input_file_path = \"data_out.csv\"\n",
|
|
"topic_model, topic_df, most_frequent_topics = ammico.text.PostprocessText(\n",
|
|
" use_csv=True, csv_path=input_file_path\n",
|
|
").analyse_topic(return_topics=10)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "0b6ef6d7",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Access frequent topics\n",
|
|
"A topic of `-1` stands for an outlier and should be ignored. Topic count is the number of occurence of that topic. The output is structured from most frequent to least frequent topic."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "43288cda-61bb-4ff1-a209-dcfcc4916b1f",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:03:31.426699Z",
|
|
"iopub.status.busy": "2023-06-28T07:03:31.426105Z",
|
|
"iopub.status.idle": "2023-06-28T07:03:31.455953Z",
|
|
"shell.execute_reply": "2023-06-28T07:03:31.455384Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "NameError",
|
|
"evalue": "name 'topic_df' is not defined",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[12], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[43mtopic_df\u001b[49m)\n",
|
|
"\u001b[0;31mNameError\u001b[0m: name 'topic_df' is not defined"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(topic_df)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "b3316770",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Get information for specific topic\n",
|
|
"The most frequent topics can be accessed through `most_frequent_topics` with the most occuring topics first in the list."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"id": "db14fe03",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:03:31.458931Z",
|
|
"iopub.status.busy": "2023-06-28T07:03:31.458466Z",
|
|
"iopub.status.idle": "2023-06-28T07:03:31.484670Z",
|
|
"shell.execute_reply": "2023-06-28T07:03:31.484074Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "NameError",
|
|
"evalue": "name 'most_frequent_topics' is not defined",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[13], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m topic \u001b[38;5;129;01min\u001b[39;00m \u001b[43mmost_frequent_topics\u001b[49m:\n\u001b[1;32m 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mTopic:\u001b[39m\u001b[38;5;124m\"\u001b[39m, topic)\n",
|
|
"\u001b[0;31mNameError\u001b[0m: name 'most_frequent_topics' is not defined"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for topic in most_frequent_topics:\n",
|
|
" print(\"Topic:\", topic)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "d10f701e",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Topic visualization\n",
|
|
"The topics can also be visualized. Careful: This only works if there is sufficient data (quantity and quality)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"id": "2331afe6",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:03:31.487585Z",
|
|
"iopub.status.busy": "2023-06-28T07:03:31.487142Z",
|
|
"iopub.status.idle": "2023-06-28T07:03:31.513497Z",
|
|
"shell.execute_reply": "2023-06-28T07:03:31.512930Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "NameError",
|
|
"evalue": "name 'topic_model' is not defined",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[14], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mtopic_model\u001b[49m\u001b[38;5;241m.\u001b[39mvisualize_topics()\n",
|
|
"\u001b[0;31mNameError\u001b[0m: name 'topic_model' is not defined"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"topic_model.visualize_topics()"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "f4eaf353",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Save the model\n",
|
|
"The model can be saved for future use."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"id": "e5e8377c",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2023-06-28T07:03:31.516740Z",
|
|
"iopub.status.busy": "2023-06-28T07:03:31.516260Z",
|
|
"iopub.status.idle": "2023-06-28T07:03:31.542273Z",
|
|
"shell.execute_reply": "2023-06-28T07:03:31.541701Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "NameError",
|
|
"evalue": "name 'topic_model' is not defined",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[15], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mtopic_model\u001b[49m\u001b[38;5;241m.\u001b[39msave(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmisinfo_posts\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
|
|
"\u001b[0;31mNameError\u001b[0m: name 'topic_model' is not defined"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"topic_model.save(\"misinfo_posts\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7c94edb9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.17"
|
|
},
|
|
"vscode": {
|
|
"interpreter": {
|
|
"hash": "da98320027a74839c7141b42ef24e2d47d628ba1f51115c13da5d8b45a372ec2"
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|