ÃÊ·Ï |
Choi, Seong-Yong, Shin, Dong-Hyok & Nam, Jeesun. (2017). A methodology for building linguistic resources that recognize unanalyzed sequences in social media texts. The Linguistic Association of Korea Journal, 25(4), 193-226. This study aims to analyze linguistic problems with unanalyzed tokens of Social Media (SM) texts and to propose methodologies for dealing with them effectively. Recently, with SM users on the rise, the need for analyzing such texts has significantly increased. However, the unanalyzed tokens severally hamper the overall performance of processing SM textual data. This study proposes two methodologies: 1) a normalizing process with a preprocessing module named Preprocessing Grammar Table (PGT) to correct frequent unanalyzed sequences such as orthographic errors and space errors; 2) a lexicon-based method utilizing DECO dictionary and Local Grammar Graph (LGG). By applying PGT and an enhanced DECO dictionary to SM texts, preprocessing performance considerably improves with 87% of the unanalyzed tokens removed, which reveals the significance of the research. |