๐Ÿ“—
JunegLee's TIL
  • TIL
  • python
    • class
    • String Basic
    • regularExpression
    • String function
    • Generator
    • String format
    • getset
    • module
    • while
    • numpy
    • print()
    • matplotlib
    • for
    • Boolean
    • tuple
    • package
    • input(variable)
    • list
    • if
    • file
    • type()
    • pandas
    • function
    • dictionary
    • ๊ตฌ๋ฌธ ์˜ค๋ฅ˜์™€ ์˜ˆ์™ธ
    • builtinFunction
    • Constructor
  • algorithm
    • sort
      • mergeSort
      • insertionSort
      • bubbleSort
      • heapSort
      • quickSort
      • selectionSort
    • recursion
    • Greedy
    • DepthFirstSearch
    • basic
      • DataStructure
    • hash
    • BreadthFirstSearch
  • tensorflow
    • keras
      • layers
        • Flatten
        • Flatten
        • Dense
        • Dense
        • Conv2D
        • Conv2D
    • tensorflow1x
    • tensorflow2x
  • DB
    • setting
    • join
    • subQuery
    • overview
  • deep-learning
    • neuralNetwork
    • perceptron
    • neuralNetworkLearning
    • convolution neural network
    • Gradient Descent
    • Linear Regression
    • backPropagation
    • logistic regression
    • overview
  • textPreprocessing
    • overview
  • java
    • basics
      • generic
      • Variable
      • String
    • theory
      • Object Oriented Programing
  • NLP
    • Embedding
    • Natural Language Processing
Powered by GitBook
On this page
  • Text Preprocessing
  • ํ…์ŠคํŠธ ์ „์น˜๋ฆฌ (Text Preprocessing)
  • ์ž์—ฐ์–ธ์–ด์ฒ˜๋ฆฌ
  • ์ž์—ฐ์–ธ์–ด
  • ํ˜•์‹ ์–ธ์–ด
  • ์ž์—ฐ์–ธ์–ด ์ฒ˜๋ฆฌ์— ์ดํ•ด
  • ์ž์—ฐ์–ธ์–ด์ฒ˜๋ฆฌ์˜ ์ฃผ์š” TASK

Was this helpful?

  1. textPreprocessing

overview

Text Preprocessing

ํ…์ŠคํŠธ ์ „์น˜๋ฆฌ (Text Preprocessing)

  • ์ž์—ฐ์–ธ์–ด์ฒ˜๋ฆฌ(NLP) ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์— ๋งž๊ฒŒ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ(raw data)๋ฅผ ๋ณ€ํ˜•ํ•˜๋Š” ์ผ๋ จ์˜ ๊ณผ์ •

  • ์ „์ฒ˜๋ฆฌ๊ฐ€ ์—†์ด ์›์‹œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์‹œ์Šคํ…œ์˜ ํ’ˆ์งˆ ์ €ํ•˜๋กœ ์ด์–ด์ง

  • ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ ๋‹จ๊ณ„์™€ ์šด์˜ ๋‹จ๊ฒŒ์—์„œ ๋ชจ๋‘ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ํ•„์š”

  • ์ „์ฒ˜๋ฆฌ ๋˜ํ•œ NLP์˜ ์ผ๋ถ€ ๊ณผ์ • => ํ† ํฐ ๋ถ„๋ฆฌ, ํ‘œ์ œ์–ด ๋ณต์› ๋“ฑ

  • ์ž์—ฐ์–ด ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜(์ž„๋ฒ ๋”ฉ)

  • ๋จธ์‹ ๋Ÿฌ๋‹/๋”ฅ๋Ÿฌ๋‹/๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹ ์ปจํ…์ŠคํŠธ์—์„œ์˜ ์ „์ฒ˜๋ฆฌ

์ž์—ฐ์–ธ์–ด์ฒ˜๋ฆฌ

  • ์ธ๊ณต์–ธ์–ด๊ฐ€ ์•„๋‹Œ ์ž์—ฐ์–ธ์–ด ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘, ๊ฐ€๊ณต, ๋ถ„์„, ๋ณ€ํ™˜ํ•˜๋Š” ์†Œํ”„ํŠธ์›จ์–ด ํ”„๋กœ์„ธ์‹ฑ

์ž์—ฐ์–ธ์–ด

  • ํ•œ๊ตญ์–ด, ์˜์–ด, ์ค‘๊ตญ์–ด์™€ ๊ฐ™์ด ์‚ฌ๋žŒ์ด ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด

  • ์–ธ์–ด ๊ณต๋™์ฒด๊ฐ€ ๋ฉ”์„ธ์ง€ ๊ตํ™˜์„ ์œ„ํ•ด ์ž์—ฐ์ ์œผ๋กœ ๋ฐœ์ „์‹œํ‚จ ๊ธฐํ˜ธ ์ฒด๊ณ„

  • ๊ทœ์น™์ด ์žˆ์œผ๋‚˜ ์—„๊ฒฉํ•˜๊ฒŒ ์ง€ํ‚ค์ง€ ์•Š์•„๋„ ๋ฉ”์‹œ์ง€ ์ „๋‹ฌ ๊ฐ€๋Šฅ

  • ๋ชจํ˜ธ์„ฑ(์ค‘์˜์„ฑ)์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค

  • ์Œ์„ฑ์–ธ์–ด, ๋ฌธ์ž์–ธ์–ด

ํ˜•์‹ ์–ธ์–ด

  • ํŠน์ • ๋ถ„์•ผ์—์„œ๋งŒ ํ•œ์ •๋˜์–ด ์‚ฌ์šฉ๋˜๋Š” ์–ธ์–ด

  • ํ˜•์‹ ๋ฌธ๋ฒ• : ๋ฌธ์ž์—ด ์ƒ์„ฑ(generation)์— ์‚ฌ์šฉ๋˜๋Š” ์‹ฌ๋ณผ๊ณผ ๊ทœ์น™์— ์ง‘ํ•ฉ

  • ์ˆ˜ํ•™, ์–ธ์–ดํ•™, ์ปดํ“จํ„ฐ ๊ณผํ•™ ๋ถ„์•ผ์—์„œ ์‚ฌ์šฉ ex) ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด, ์ˆ˜ํ•™์‹, ํ™”ํ•™์‹

  • ์—„๊ฒฉํ•œ ๋ถ„๋ฒ•

์ž์—ฐ์–ธ์–ด ์ฒ˜๋ฆฌ์— ์ดํ•ด

  • ์–ธ์–ด๋ฅผ ํ†ตํ•œ ์˜๋ฏธ์ „๋‹ฌ์€ ๋Œ€ํ™”์ฃผ์ฒด๊ฐ„ ์˜์‹ ์†์— ๊ณต์œ ๋œ ๊ฐœ๋… (shared concept)์ด ์žˆ์–ด์•ผ ๊ฐ€๋Šฅํ•˜๋‹ค ex) ์‚ด๋ฏธ์•„ํ‚ค(์™ธ๊ตญ์˜ ์‚ฌํƒ•)์€ ๋Œ€ํ•œ๋ฏผ๊ตญ ๊ตญ๋ฏผ์—๊ฒŒ๋Š” ์ƒ์†Œํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ดํ•ดํ•˜๊ธฐ ์–ด๋ ค์šธ ์ˆ˜ ์žˆ๋‹ค

  • ์–ธ์–ด๋Š” ๋ฐœ์ „, ์Šต๋“, ์œ ์ง€๋กœ ๊ตฌ์„ฑ๋œ ์‹œ์Šคํ…œ์ด๋ฉฐ, ์ปค๋ฎค๋‹ˆ์ผ€์ด์…˜์˜ ๋ณต์žกํ•œ ์‹œ์Šคํ…œ์˜ ์‚ฌ์šฉ์ด๋‹ค.

์ž์—ฐ์–ธ์–ด์ฒ˜๋ฆฌ์˜ ์ฃผ์š” TASK

  • ํ† ํฐ์ฒ˜๋ฆฌ๋ถ€ํ„ฐ ์ž๋™๋ฒˆ์—ญ, ๋Œ€ํ™”์‹œ์Šคํ…œ๊นŒ์ง€ ๋‹ค์–‘ํ•œ NLP Task ์กด์žฌ

  • ์ „์ฒ˜๋ฆฌ๋„ NLP Task์— ์†ํ•จ

  • NLP ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—๋Š” ๋‹ค์–‘ํ•œ ์ „์ฒ˜๋ฆฌ ๋ชจ๋“ˆ๋“ค์ด ํฌํ•จ๋˜์–ด ์žˆ์Œ.

PrevioustextPreprocessingNextjava

Last updated 3 years ago

Was this helpful?

NLP