coreseek Archives - 論野生技術&二次元

Sphinx 2.11.1/2.3.2 mmseg patch (coreseek based)

on 2017 年 9 月 22 日

0 13768 轉為簡體

This patches include work from nzinfo (add mmseg support) and my patch for Hiragana and Katagana support (see this blog post). The changes can be viewed here.

For Sphinx 2.11.1：Github dl.yooooo.us

For Sphinx 2.3.2：Github dl.yooooo.us

Sphinx 2.2.11和2.3.2的mmseg分詞補丁（基於Coreseek）

on 2017 年 9 月 22 日

C/C++

0 15540 轉為簡體

English Version

補丁包括了nzinfo在coreseek中貢獻的mmseg補丁，以及我提交的日語假名補丁（詳見這篇博客）。詳細更改見這裡。

打上補丁之後，可以使用mmseg來做分詞器。如果是對大篇幅的文章做索引，推薦使用mmseg處理結巴分詞的詞庫來生成一個比較靠譜的詞庫。

Sphinx 2.11.1：Github下載本地下載

Sphinx 2.3.2：Github下載本地下載

讓Coreseek支持索引日語假名

on 2016 年 3 月 15 日

C/C++

10 20770 轉為簡體

coreseek是一個修改版的sphinx，用mmseg來做中文分詞。但是發現一個問題，日語搜索總是效果很差，全部是假名的關鍵詞會返回一個空結果。

開始猜想是不是詞庫沒有包含日語的關係，後來仔細想了一想，mmseg對於沒有在詞典里的詞應該是直接一元分詞的，按理說也不應該出現無法索引日語的關係。我們可以通過mmseg命令行工具來證明這一點：

$ /usr/local/mmseg/bin/mmseg -d /usr/local/mmseg/etc/ 1.txt
ヨ/x ス/x ガ/x ノ/x ソ/x ラ/x

1 2	$ /usr/local/mmseg/bin/mmseg -d /usr/local/mmseg/etc/ 1.txt ヨ/x ス/x ガ/x ノ/x ソ/x ラ/x

證明mmseg進行了一元分詞。

那麼為什麼coreseek搜不到假名呢？我找啊找啊終於發現在coreseek使用mmseg進行分詞的過程中，對輸入字元做了一個過濾，並且有一個注釋：

// BEGIN CJK There is no case folding, should do this in remote tokenizer.
// Here just make CJK Charactor will remain. --coreseek
dRemaps.Add ( CSphRemapRange ( 0x4e00, 0x9FFF, 0x4e00 ) );
dRemaps.Add ( CSphRemapRange ( 0xFF00, 0xFFFF, 0xFF00 ) );
dRemaps.Add ( CSphRemapRange ( 0x3040, 0x303F, 0x3040 ) );

// BEGIN CJK There is no case folding, should do this in remote tokenizer.

// Here just make CJK Charactor will remain. --coreseek

dRemaps.Add ( CSphRemapRange ( 0x4e00, 0x9FFF, 0x4e00 ) );

dRemaps.Add ( CSphRemapRange ( 0xFF00, 0xFFFF, 0xFF00 ) );

dRemaps.Add ( CSphRemapRange ( 0x3040, 0x303F, 0x3040 ) );

可見coreseek雖然將CJK (Chinese, Japanese, Korean) 中所有漢字、全形字元和標點加入了範圍，但是卻漏掉了平假名和片假名。因此我們將第三個range改成0x3000, 0x30FF, 0x3000就可以修正這個問題。

其中：

// 4e00 - 9fff CJK unified ideographs
// 3000 - 303f CJK symbols and punctuation
// 3040 - 30ff Hiragana/Katagana
// ff00 - ffff half/fullwidth forms

// 4e00 - 9fff CJK unified ideographs

// 3000 - 303f CJK symbols and punctuation

// 3040 - 30ff Hiragana/Katagana

// ff00 - ffff half/fullwidth forms

我把修改後的版本放到了github

另外，這裡可以查詢到Unicode編碼範圍對應的字元內容；unicode.org有一個database，但是是一個列出了全部字元的大pdf，我似乎沒有找到類似的分類。

對於Ubuntu/Debian，這裡有編譯好的coreseek的deb包：i386 amd64；依賴於mmseg：i386 amd64；mmseg自帶的詞典

對於>2.2.10的版本，我在這篇博客里提供了完整的補丁，可以應用在sphinx的源碼上編譯。

Coreseek4.x/5.x編譯和一些注意事項

on 2015 年 1 月 2 日

0 14836 轉為簡體

使用ubuntu/debian系統32位的可以在這裡下載我編譯好的coreseek5/csft包

對於>2.2.10的版本，我在這篇博客里提供了完整的補丁，可以應用在sphinx的源碼上編譯。

編譯mmseg

git clone  https://github.com/nzinfo/mmseg
cd mmseg
automake --add-missing
./bootstrap
./configure --prefix=/usr/local/mmseg
make && make install

git clone https://github.com/nzinfo/mmseg

cd mmseg

automake --add-missing

./bootstrap

./configure --prefix=/usr/local/mmseg

make && make install

編譯coreseek5

apt-get install Cython
git clone https://github.com/nzinfo/csft
cd csft
git checkout r/csft5
sh buildconf.sh
automake --add-missing
./configure --prefix=/usr/local/coreseek --with-mysql  --with-mmseg-includes=/usr/local/mmseg/include/mmseg --with-mmseg-libs=/usr/local/mmseg/lib

apt-get install Cython

git clone https://github.com/nzinfo/csft

cd csft

git checkout r/csft5

sh buildconf.sh

automake --add-missing

./configure --prefix=/usr/local/coreseek --with-mysql --with-mmseg-includes=/usr/local/mmseg/include/mmseg --with-mmseg-libs=/usr/local/mmseg/lib

自動腳本有一些bug ：

手動運行一次autoconf和automake
./configure xxxx之後，修改src/Makefile在LIBS加上-L/usr/local/mmseg3/lib -lmmseg
~~am__object_1 增加 tokenizer_zhcn.$(OBJEXT)~~
~~SRC_SPHINX增加tokenizer_zhcn.cpp~~
~~gcc 5.不知道多少版本以後會腦殘，sphinx.cpp里有個地方要加上this->，忘記具體變數了，沒關係gcc會告訴你在哪的~~

如果用4.1的話不能用最新版的sphinxclient，可以去sphinx_php_api的trunk里找define ( “VER_COMMAND_SEARCH”, 0x117 );的版本（0x117 ->1.23)

最後貼一個圖來表達對xunsearch的…………

又慢（搜索比sphinx慢5~10倍，索引體積比原始數據大7倍，比sphinx大21倍），結果又少（mysql模糊搜索≈coreseek搜索≈740+，這貨只有114，$search->count獲得的精確值是286條）

另外ini文件名和裡面填的project_name不一致會神作

告訴我是我打開方式不對

Tag Archives