登录

双语推荐:分词

標準化しつつある日本語教育文法には、(1)体系性の欠如、(2)語彙的な部分と文法的な部分の分離、(3)形式重視、といった負の側面がある。動詞の活用を例に、体系性を指摘し、名詞の格を例に、語彙的な部分と文法的な部分の統一と形態論の必要を説き、「~の」の形式をもつ形容詞を例に、意味·機能を重視すべきことを提言した。
Standardized grammar teaching in Japanese involves the following negative aspects:(1)a lack of systematization, (2)a separation between the lexical part and the grammatical part, and(3)a tendency to put too much emphasis on form. This paper discusses the systematization of verbs by taking the usage of verbs for example. The integration of the lexical part and the grammatical part as well as the necessity of modalism is also discussed in this paper, taking the possession of nouns as example. Thus, I suggest that we focus on the semantic-functional aspect in grammar by taking adjectives with no form as an example.
由于中文语言的复杂性,给中文分词系统带来了较大的困难,不论哪种分词系统都不能百分百的解决分词问题。针对目前中文分词存在的困难与问题,主要探讨了几种常见的中文分词算法及各自的优缺点。
Because of the complexity of Chinese language it brought the major difficulty to Chinese word segmentation system. No matter what kind of word segmentation system can not solve the problems perfectly. In view of the difficulties and problems existing in Chinese word segmentation, this paper mainly discussed several common Chinese word segmentation algorithm and the advantages and disadvantages of each.

[ 可能符合您检索需要的词汇 ]

中文分词技术是中文信息处理的基础,快速、准确的中文分词方法是进行中文信息搜索的关键。基于N-最短路径的分词算法,需要计算有向图中从起点到终点的所有路径值,分词效率低,将动态删除算法与最短路径算法结合,通过从最短路径中删除部分节点的策略减少搜索路径范围,从而提高分词效率。
Chinese word segmentation is the basis of information processing , rapid and accurate Chinese word segmentation method is the key for information search .Segmentation algorithm based on N -shortest paths , need to calculate the directed graph from the starting point of all the path to the end , the word segmentation effi-ciency is low , the dynamic deletion algorithm combined with the algorithm of the shortest path , by removing part of the node from the shortest path strategy to reduce the search path , so as to improve the efficiency of segmenta-tion.

[ 可能符合您检索需要的词汇 ]

分词和词性标注是中文语言处理的重要技术,广泛应用于语义理解、机器翻译、信息检索等领域。在搜集整理当前分词和词性标注研究与应用成果的基础上,对中文分词和词性标注的基本方法进行了分类和探讨。首先在分词方面,对基于词典的和基于统计的方法进行了详细介绍,并且列了三届分词竞赛的结果;其次在词性标注方面,分别对基于规则的方法和基于统计的方法进行了阐述;接下来介绍了中文分词和词性标注一体化模型相关方法。此外还分析了各种分词和词性标注方法的优点和不足,在此基础上,为中文分词和词性标注的进一步发展提供了建议。
Word segmentation and Part-Of-Speech ( POS) tagging are the basic task of the CLP ( Chinese Language Processing) and are widely applied in the semantic understanding,machine translation,information retrieval and other fields. In this paper,based on collecting current research and application results of word segmentation and part-of-speech tagging,analyze and classify the basic methods of Chi-nese Word Segmentation ( CWS) and POS tagging. First in terms of word segmentation,dictionary-based segmentation method and sta-tistics-based segmentation method were introduced in detail and some word segmentation results of the competition were also listed. Sec-ondly in terms of POS tagging,rule-based method and statistics-based method were expounded. Next,the main methods of building the model for joint CWS and POS tagging were presented. In this paper,also analyze the advantages and disadvantages for methods of CWS and POS tagging,based on which suggestions for the further development

[ 可能符合您检索需要的词汇 ]

藏文自动分词问题是藏文自然语言处理的基本问题之一。针对藏文自动分词中的重点难点,例如:格助词的识别、歧义切分、未登录词识别技术设计一个新的藏文自动分词系统。该系统采用动态词频更新和基于上下文词频的歧义处理和未登录词识别技术。在歧义字段分词准确性、未登录词识别率和分词速度上,该系统具有较优的性能。
Automatic Tibetan word segmentation is one of the basic problems in natural language processing of Tibetan.In this paper,we design a new automatic Tibetan word segmentation system in light of the keys and difficulties in it,for example:the technologies of identification of case-auxiliary word,the ambiguity segmentation,and the unknown words recognition.The system uses the techniques of the dynamic word frequency up-date and the ambiguity treatment and unknown words recognition which are based on the word frequency of the context.The presented system has relatively high performance in terms of the recognition accuracy of ambiguities,the recognition rate of unknown word and the segmentation speed.
词语粗分是分词后续处理的基础和前提,直接影响到分词系统最终的准确率和召回率。针对目前常用分词方法单一使用时存在的不足,综合机械分词的高效性和统计分词的灵活性,设计一种基于最短路径的二元语法中文词语粗分模型。实验结果表明,此粗分模型无论在封闭测试和开放测试中,还是在不同粗分模型对比测试和不同领域的开放测试中,都有较好的句子召回率。
The words rough segmentation is the foundation and premise of the segmentation following pro-cessing, which directly influences to the final rate of accuracy and recall of the segmentation sys-tem.In view of the limitation of present segmentation methods when they are used singly, and to synthesize efficiency of mechanical segmentations and flexibility of the statistical segmentations, designs a Chinese words rough segmentation model which is based on shortest-path and Bi-gram model.Experimental results show that the model has good effect of coarse points and sentence re-call rate, solve the most of ambiguity problems and decrease the problems of sparse data.

[ 可能符合您检索需要的词汇 ]

中文自动分词是实现搜索引擎信息检索的基础,分词词典是汉语自动分词系统的一个重要组成部分,词典的加载和查询速度直接影响到分词系统的速度。文中在研究传统词典机制的基础上,分析了基于双字哈希词典机制对词条除首次字外剩余词的不足,给出了一种改进的双字哈希的词典机制。最后,文中对改进算法从准确率、分全率和分词速度等方面进行了测试,结果表明,改进后的分词算法在不提升已有典型词典机制维护复杂度的情况下,提高了词条匹配的查询速度和效率。
Chinese automatic segmentation is the base of the information retrieval search engine. Word dictionary is an important part of Chinese word segmentation system. The loading and querying efficiency is a key impact fact of the word segmentation system. Based on the study of the traditional dictionary mechanism,analyze the weak point of the double word hash dictionary,and propose a modified double hash dictionary. At last test the method from the accurate,full-rate,word speed,etc. With the result of the test,this improved hash mechanism enhances the entry speed and efficiency of matching queries,without completing the maintenance complexity of the traditional dictionary.

[ 可能符合您检索需要的词汇 ]

该文设计了一个基于复杂形式最大匹配算法(MMSeg_Complex)的自定义中文分词器,该分词器采用四种歧义消除规则,并实现了用户自定义词库、自定义同义词和停用词的功能,可方便地集成到Lucene中,从而有效地提高了Lucene的中文处理能力。通过实验测试表明,该分词器的分词性能跟Lucene自带的中文分词器相比有了极大的提高,并最终构建出了一个高效的中文全文检索系统。
This paper designed a custom Chinese word analyzer that based on a complex form of maximum matching algorithm (MMSEG_Complex). This analyzer use four kinds of disambiguation rules, and has achieved user-defined thesaurus、custom func-tion of synonyms and stop words , which can be easily integrated into Lucene, thus effectively improving the Chinese processing capabilities of Lucene. Through experiments we found that this analyzer''s performance of Chinese word segmentation has been greatly improved compared to the Chinese word analyzer which built-in Lucene, and then we can eventually build an effective Chinese full-text retrieval system.

[ 可能符合您检索需要的词汇 ]

目的:研究适用于中医医案文献自动分词的方案。方法使用层叠隐马模型作为分词模型,建立相关中医领域词典及测试语料库,对语料库中古代医案文献和现代医案文献各300篇进行分词及评测。结果在未使用中医领域词典时,两类医案文献分词准确率均为75%左右;使用中医领域词典后,古代医案文献的分词准确率达到90.73%,现代医案文献的分词准确率达到95.66%。在未使用中医领域词典时,词性标注准确率古代医案文献为56.74%,现代医案文献为64.81%;使用中医领域词典后,现代医案文献为91.45%,明显高于古代医案文献的78.47%。结论现有分词方案初步解决了中医医案文献的分词问题,对现代医案文献的词性标注也基本正确,但古代医案文献的词性标注影响因素较多,还需进一步研究。
Objective To study the automatic word segmentation scheme suitable for traditional Chinese medical record literature. Methods Hierarchical Hidden Markov Model was used as segmentation model. Totally 300 ancient medical record literature and 300 modern medical record literature were set as experimental subjects to establish the dictionary of traditional Chinese medicine and the test corpus, with a purpose to segment the words and evaluate of the results. Results Without using dictionary of traditional Chinese medicine, the word segmentation accuracy of two kinds of medical record literature was about 75%;the part-of-speech tagging accuracy of ancient medical literature was 56.74%, the modern medical literature accuracy was 64.81%. By using dictionary of tradition Chinese medicine, the word segmentation accuracy of ancient medical record literature was 90.73%, the modern medical record literature accuracy was 95.66%;the part-of-speech tagging accuracy of ancient medical record
汉语分词是中文信息处理的一项基础性工作。为避免人工阅读或机器处理时的分词歧义和未登录词难以识别的问题,有专家建议写作时在汉语词之间添加空格。文章从语言学本体研究、语言使用以及语言工程等不同角度对传统观念下的汉语分词存在的困难进行探讨,指出汉语分词在词的定义、群众语感以及分词规范、词表确定及工程应用等方面都存在不确定及不一致等因素。近年汉语自动分词处理不纠缠于词的确切定义,以字组词,针对标注语料和网络上带有丰富结构信息的海量文本,利用机器学习方法对汉语“切分单位”的标注取得了较好的进展。针对基础性的汉语分词规范,从语言规划的政策性、科学性及引导性角度提出建议,最后指出结合语言学指导和数据驱动的机器学习策略,可望为实现汉语自动分词的准确性和适应性提升服务。
Chinese word segmentation is fundamental for Chinese information processing.To a-void ambiguity and out-of-vocabulary word,there was a proposal for adding a manual space between Chinese words,which we disagree with it.This paper first elaborates difficulties in word segmenta-tion from the point of linguistic studies,language performance and language engineering,and then it discusses some uncertain factors in definition of the word,language awareness,word segmentation specification,construction of word list and its application in automatic text information processing. Not dwelling on exact definition of the word,the paper lists recent advances in character-based tag-ging with massive manually annotated recourses,which show an inspiring progress.At the end of the paper,we put forward a word segmentation guideline from a stance of language policy strategies. Guided by linguistic theory and data-driven machine learning algorithms,a practical word segmenta-tion system can achieve

[ 可能符合您检索需要的词汇 ]