CJKSplitter - Chinese, Japanese, Korean word splitter for ZCTextIndex

CJKSplitter is a ZCTextIndex splitter for CJK (Chinese-Japenese-Korea) text stored as Unicode. It uses a simple, but workable, "hack" instead of trying to do real word splitting from dictionaries. Compared to a dictionary based word splitter, this results in a bigger index and more matches than necessary, but it is a cheap price to pay for the reduced complexity.

Feature

  • use regular expression to compatible with defualt English white space splitter
  • much simpler code, easy to install, easy to use
  • support multiple encodings: unicode/utf-8/gb18030/gbk/gb2312/mbcs/big5. provide 3 splitters(more to come):
    • 'CJK splitter' : support unicode/utf-8 encoding.
    • 'CJK GB splitter' : support unicode/gb18030/gbk/gb2312/mbcs encodings.
    • 'CJK BIG5 splitter' : support unicode/big5/mbcs encodings
  • support english globing
  • support single Chinese charactor search