notes/pl/cpp/libfws/nlp/scws.txt
Ihar Hancharenka 5dff80e88e first
2023-03-27 16:52:17 +03:00

51 строка
1.6 KiB
Plaintext

xunsearch
https://github.com/hightman/xunsearch
https://github.com/hightman/scws
scws
http://www.xunsearch.com/scws/
http://www.xunsearch.com/scws/index.php
Latest:
1.2.3
http://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2
Archive:
http://www.xunsearch.com/scws/down/scws-1.1.8.tar.bz2
https://github.com/amutu/scws-port/blob/master/pkg-descr
SCWS (Simple Chinese Word Segmentation) is a frequency dictionary based Chinese word segmentation engine,
it can cut a whole section of the Chinese text into words.
Word is the smallest unit of morpheme in Chinese,
but in Chinese words are not separated by spaces,
so word segmentation is an important step for Chinese language process.
SCWS is written in C without other dependencies and accept GBK and UTF-8 encoding
for both the Simple Chinese (zh_CN) and the Traditional Chinese (such as zh_TW).
http://www.xunsearch.com/scws/download.php
http://www.xunsearch.com/scws/support.php
http://www.xunsearch.com/scws/docs.php
Ruleset file:
SCWS and PSCWS4 general rule set file that is used to identify people, places, and so on in the digital age.
Contains simplified GBK, Traditional Chinese UTF8, UTF8 Simplified three files.
You do not need to be downloaded separately, together with scws released source package already contains these files.
https://github.com/hightman/scws/blob/master/etc/rules_cht.utf8.ini
API:
http://www.xunsearch.com/scws/api.php
https://github.com/hightman/scws/blob/master/API.md
scws_new, scws_set_charset, scws_add_dict, scws_[add/set]_dict, scws_set_rule
Dictionaries:
gen-dict:
Usage:
scws-gen-dict [options] [-i] dict.txt [-o] dict.xdb