diff options
Diffstat (limited to 'textproc/split-thai/files/README.txt')
-rw-r--r-- | textproc/split-thai/files/README.txt | 93 |
1 files changed, 57 insertions, 36 deletions
diff --git a/textproc/split-thai/files/README.txt b/textproc/split-thai/files/README.txt index 7b91f97fb9a..7480d4b4c2a 100644 --- a/textproc/split-thai/files/README.txt +++ b/textproc/split-thai/files/README.txt @@ -1,49 +1,70 @@ -This is a collection of utilities to separate Thai words by spaces -(word tokenization). They can separate stdin, files, or text as -arguments. It includes 3 separate utilities: +NAME + st-emacs + st-icu + st-swath -st-emacs: emacs-script using emacs lisp thai-word library - https://www.gnu.org/software/emacs/ -st-icu: basic C++ program using the ICU library - http://site.icu-project.org/ -st-swath: sh script wrapper to simplfy args to the swath program - https://linux.thai.net/projects/swath +SYNOPSIS + st-emacs|st-icu|st-swath [filename|text1 text2 ...|'blank'] -All scripts should be able to take a filename, stdin, or arguments as -input, e.g., : +DESCRIPTION + This package is a collection of utilities to separate Thai words + by spaces (word tokenization). They can separate stdin, files, + or text as arguments. It includes 3 separate utilities: + st-emacs: emacs-script using emacs lisp thai-word library + https://www.gnu.org/software/emacs/ + st-icu: basic C++ program using the ICU library + http://site.icu-project.org/ + st-swath: sh script wrapper to simplfy args to the swath program + https://linux.thai.net/projects/swath + +EXAMPLES + split one or more text strings # st-swath แมวและหมา -or - # echo "แมวและหมา" | st-swath -or - # st-swath < thaifile.txt -or # st-swath "แมวหมา" พ่อและแม่ -You will most likely need to set LC_ALL or LC_CTYPE to an approriate -unicode value, e.g., en_US.UTF-8 or C.UTF-8, in the environment for -them to work properly. These tools are setup to only support UTF-8 -encodings. + read stdin + # echo "แมวและหมา" | st-swath + + read from a file + # st-swath < thaifile.txt + # st-swath somefile.txt + + They can also read directly from stdin + # st-icu + แมวหมา (typed in) + แมว หมา (output line by line) + +ENVIRONMENT + You will most likely need to set the environment variables LC_ALL + or LC_CTYPE for proper unicode handling, e.g., en_US.UTF-8 or + C.UTF-8. These tools are only setup to handle UTF-8 encodings. -Note that it is not possible to split Thai words 100% accurately -without context and meaning. These programs use dictionary-based word -splitting. +EXIT STATUS + 0 for success, non zero otherwise -Also included in the package is a combined thai word dictionary and -corresponding .tri file, and emacs lisp .el file for reading and -dumping out dictionary files. +NOTES + Note that it is not possible to split Thai words 100% accurately + without context and meaning. All these programs use + dictionary-based word splitting. -st-emacs and st-swath are setup to use the combined dictionary with -words from the emacs 'thai-word library, swath dictionary words, and -the icu thai library words. + Also included in the package is a combined thai word dictionary + and corresponding .tri file, and emacs lisp .el files for reading + and dumping out dictionary files. -st-icu uses its own built in library. To customise the icu -dictionary, you apparently would have to modify - icu4c/source/data/brkitr/dictionaries/thaidict.txt -and rebuild icu library, and then rebuild the whole thing. + st-emacs and st-swath are setup to use the combined dictionary + with words from the emacs 'thai-word library, swath dictionary + words, and the icu thai library words. -There is also + st-icu uses its own built in library. To customise the icu + dictionary, you apparently would have to modify + icu4c/source/data/brkitr/dictionaries/thaidict.txt and then + rebuild the whole library. -See also swath(1), libthai(1), emacs(1), locale(1), uconv(1), iconv(1) +SEE ALSO + swath(1), libthai(1), emacs(1), locale(1), uconv(1), iconv(1) -TODO - fix st-icu to use all the combined dictionary words. +BUGS + st-icu should also use the combined dictionary words. + st-emacs and st-icu don't always split thai numbers well. + this file should be converted to a proper manpage. |