대한언어학회The Linguistic Association of Korea

학회지

제목	소셜 미디어 텍스트의 미분석어 처리를 위한 전처리기 및 사전확장 연구
저자	최성용ㆍ신동혁ㆍ남지순
권/호	제25권 / 4호
출처	193-226
논문게재일	2017. 12. 31.
초록	Choi, Seong-Yong, Shin, Dong-Hyok & Nam, Jeesun. (2017). A methodology for building linguistic resources that recognize unanalyzed sequences in social media texts. The Linguistic Association of Korea Journal, 25(4), 193-226. This study aims to analyze linguistic problems with unanalyzed tokens of Social Media (SM) texts and to propose methodologies for dealing with them effectively. Recently, with SM users on the rise, the need for analyzing such texts has significantly increased. However, the unanalyzed tokens severally hamper the overall performance of processing SM textual data. This study proposes two methodologies: 1) a normalizing process with a preprocessing module named Preprocessing Grammar Table (PGT) to correct frequent unanalyzed sequences such as orthographic errors and space errors; 2) a lexicon-based method utilizing DECO dictionary and Local Grammar Graph (LGG). By applying PGT and an enhanced DECO dictionary to SM texts, preprocessing performance considerably improves with 87% of the unanalyzed tokens removed, which reveals the significance of the research.
파일	PDF보기 다운로드