´ëÇѾð¾îÇÐȸ ÀüÀÚÀú³Î

´ëÇѾð¾îÇÐȸ

25±Ç 4È£ (2017³â 12¿ù)

¼Ò¼È ¹Ìµð¾î ÅؽºÆ®ÀÇ ¹ÌºÐ¼®¾î 󸮸¦ À§ÇÑ Àüó¸®±â ¹× »çÀüÈ®Àå ¿¬±¸

ÃÖ¼º¿ë¤ý½Åµ¿Çõ¤ý³²Áö¼ø

Pages : 193-226

DOI : https://doi.org/10.24303/lakdoi.2017.25.4.193

PDFº¸±â

¸®½ºÆ®

Abstract

Choi, Seong-Yong, Shin, Dong-Hyok & Nam, Jeesun. (2017). A methodology for building linguistic resources that recognize unanalyzed sequences in social media texts. The Linguistic Association of Korea Journal, 25(4), 193-226. This study aims to analyze linguistic problems with unanalyzed tokens of Social Media (SM) texts and to propose methodologies for dealing with them effectively. Recently, with SM users on the rise, the need for analyzing such texts has significantly increased. However, the unanalyzed tokens severally hamper the overall performance of processing SM textual data. This study proposes two methodologies: 1) a normalizing process with a preprocessing module named Preprocessing Grammar Table (PGT) to correct frequent unanalyzed sequences such as orthographic errors and space errors; 2) a lexicon-based method utilizing DECO dictionary and Local Grammar Graph (LGG). By applying PGT and an enhanced DECO dictionary to SM texts, preprocessing performance considerably improves with 87% of the unanalyzed tokens removed, which reveals the significance of the research.

Keywords

# Àüó¸®(preprocessing) # ¼Ò¼È ¹Ìµð¾î(social media) # ¹ÌºÐ¼®¾î(unanalyzed tokens) # Àüó¸® ¹®¹ý Å×À̺í(PGT) # µ¥ÄÚ»çÀü(DECO dictionary) # ºÎºÐ ¹®¹ý ±×·¡ÇÁ(local grammar graph).

References

  • ±è¼±È£, À±ÁØÅÂ, ¼Û¸¸¼®. (2002). Çѱ¹¾î ¹®¼­ 󸮸¦ À§ÇÑ µ¿Àû »ý¼º ·ÎÄà »çÀü ±â¹Ý ¹Ìµî·Ï¾î ºÐ¼®. Á¤º¸°úÇÐȸ³í¹®Áö: ¼ÒÇÁÆ®¿þ¾î ¹× ÀÀ¿ë, 29(6), 407-416.
  • ³²±æÀÓ. (2016). »óÇ°Æò ÅؽºÆ®¿¡ ³ªÅ¸³­ °¨¼ºÇ¥Çö ¿¬±¸ -°¨¼ººÐ¼®°ú ±¹¾îÇÐ ¿¬±¸ÀÇ Á¢Á¡. ¾ð¾î°úÇבּ¸, 78, 101-123.
  • ³²Áö¼ø. (2010). Korean Electronic Dictionary DECO, DICORA-TR-2010-02. Çѱ¹¿Ü±¹¾î´ëÇб³ µðÄڶ󿬱¸¼¾ÅÍ.
  • ³²Áö¼ø. (2013). ¸ð¸®½º ±×·Î½ºÀÇ ¾ð¾îó¸® ¸ðµ¨°ú Àü»êÇÐÀû Àû¿ëÀÇ ÀÌÇØ. Àι®¾ð¾î, 15(1), 125-151.
  • ¹ÚºÀ·¡, Ȳ¿µ¼÷, ÀÓÇØâ. (1998). ¿ë·Ê ºÐ¼®¿¡ ±â¹ÝÇÑ ¹Ìµî·Ï¾îÀÇ ÀνÄ. Á¤º¸°úÇÐȸ³í¹®Áö, 25(2), 397-407.
  • ¹Ú¼Ò¿µ. (2008). À¥¹®¼­¿¡¼­ÀÇ ÃâÇöºóµµ¸¦ ÀÌ¿ëÇÑ Çѱ¹¾î ¹Ìµî·Ï¾î »çÀü ÀÚµ¿ ±¸Ãà. Çѱ¹ÄÄÇ»ÅÍÁ¤º¸ÇÐȸ³í¹®Áö, 13(3), 27-33.
  • ¹Ú¿µÁØ. (1994). Çö´ë±¹¾îÀÇ ±¹¾î»çÀû ¿¬±¸. ±¹ÇÐÀÚ·á¿ø.
  • ¹èÁÖä. (2017). ±³Ã¼ÀÇ °³³ä°ú Á¶°Ç. ±¹¾îÇÐ, 81, 295-324.
  • ¾çÀå¸ð, ±è¹ÎÁ¤, ±ÇÇõö. (1996). ¾ð¾îÁ¤º¸¸¦ ÀÌ¿ëÇÑ Çѱ¹¾î ¹Ìµî·Ï¾î ÃßÁ¤. Çѱ¹Á¤º¸°úÇÐȸ º½ Çмú¹ßÇ¥³í¹®Áý, 23(1), 957-960.
  • À̵µ±æ, ÀÌ»óÁÖ, ÀÓÇØâ. (2003). ¸í»ç ÃâÇö Ư¼ºÀ» ÀÌ¿ëÇÑ È¿À²ÀûÀÎ Çѱ¹¾î ¸í»ç ÃßÃâ ¹æ¹ý. Á¤º¸°úÇÐȸ³í¹®Áö: ¼ÒÇÁÆ®¿þ¾î ¹× ÀÀ¿ë, 30(2), 173-183.
  • À̼¼Èñ, ±èÇмö. (2009). À½Àý Åë°è¸¦ ÀÌ¿ëÇÑ °æ·®È­µÈ öÀÚ ¿À·ù ±³Á¤ ¸ðµ¨. Çѱ¹Á¤º¸°úÇÐȸ Çмú¹ßÇ¥³í¹®Áý, 36(1), 84-85.
  • Â÷Á¤¿ø, ÀÌ¿øÀÏ, À̱ٹè, ÀÌÁ¾Çõ. (1997). ÇüÅÂ¼Ò ÆÐÅÏ »çÀüÀ» ÀÌ¿ëÇÑ ÀϹÝÈ­µÈ ¹Ìµî·Ï¾î ó¸®. Á¤º¸°úÇÐȸ ÀΰøÁö´É¿¬±¸È¸ Ãá°èÇмú´ëȸ ³í¹®Áý, 37-42.
  • Gross, M. (1997). The construction of local grammars. In E. Roche & Y. Schabes (Eds.), Finite-State Language Processing (pp. 329-354). MA: The MIT Press.
  • Gross, M. (1999). A bootstraph method for constructing local grammars. In Proceedings of the Symposium Comtemporary Mathematics, 229-250. University of Belgrad.
  • Lee, S. (1995). A Korean part-of-speech tagging system with handling unknown words. In Proceedings of International Conference on Computer Processing Pacific Rim Symposium, 89-94.
  • Liu, B. (2012). Sentiment analysis and opinion mining. Morgan and Claypool Publishers.
  • Mikheev, A. (1996). Unsupervised learning of word-category guessing rules. In Proceedings of 34th ACL, 327-334.
  • Nagata, M. (1996). Automatic extraction of new words from Japanese texts using generalized forward-backward search. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 48-59.
  • Nam, J. (2015). Korean Electronic Dictionary DECO TR-2015-02, DICORA. Seoul, Hankuk University of Foreign Studies.
  • Park, B., & Rim, H. (1995). A Korean corpus refining system based on automatic analysis of corpus. In Proceedings of Natural Language Processing Pacific Rim Symposium, 89-94.
  • Paumier, S. (2003). De la reconnaissance de formes linguistiques a l¡¯analyse syntaxique. Unpublished doctoral Dissertation, Univ. of PEMLV, France.
  • Weichedel, R., Meteer, M., Schwartz, R., Ramshaw, L., & Palmucci, J. (1993). Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics, 19(2), 359-382.
  • Yoo, G., & Nam, J. (2017). DecoTex users¡¯ mannual DICORA-TR-2017-12. Version V01-2017MAY. Hankuk University of Foreign Studies.