汉语分词是中文信息处理的一项基础性工作。为避免人工阅读或机器处理时的分词歧义和未登录词难以识别的问题,有专家建议写作时在汉语词之间添加空格。文章从语言学本体研究、语言使用以及语言工程等不同角度对传统观念下的汉语分词存在的困难进行探讨,指出汉语分词在词的定义、群众语感以及分词规范、词表确定及工程应用等方面都存在不确定及不一致等因素。近年汉语自动分词处理不纠缠于词的确切定义,以字组词,针对标注语料和网络上带有丰富结构信息的海量文本,利用机器学习方法对汉语“切分单位”的标注取得了较好的进展。针对基础性的汉语分词规范,从语言规划的政策性、科学性及引导性角度提出建议,最后指出结合语言学指导和数据驱动的机器学习策略,可望为实现汉语自动分词的准确性和适应性提升服务。
Chinese word segmentation is fundamental for Chinese information processing.To a-void ambiguity and out-of-vocabulary word,there was a proposal for adding a manual space between Chinese words,which we disagree with it.This paper first elaborates difficulties in word segmenta-tion from the point of linguistic studies,language performance and language engineering,and then it discusses some uncertain factors in definition of the word,language awareness,word segmentation specification,construction of word list and its application in automatic text information processing. Not dwelling on exact definition of the word,the paper lists recent advances in character-based tag-ging with massive manually annotated recourses,which show an inspiring progress.At the end of the paper,we put forward a word segmentation guideline from a stance of language policy strategies. Guided by linguistic theory and data-driven machine learning algorithms,a practical word segmenta-tion system can achieve