Search

์›น ํฌ๋กค๋ง

์›น ํฌ๋กค๋ง

์›น ์ƒ์—์„œ ์›น ํŽ˜์ด์ง€๋ฅผ ํƒ์ƒ‰ํ•˜๊ณ , ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ํ”„๋กœ์„ธ์Šค
โ€ข
ํŒŒ์ด์ฌ ๊ด€๋ จ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ : urllib, requests
๊ด€๋ จ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด์„œ, ์›น ํŽ˜์ด์ง€ ์ฝ”๋“œ๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ  ๊ทธ ์ฝ”๋“œ๋ฅผ ๊ตฌ๋ฌธ ๋ถ„์„ํ•˜์—ฌ ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

์›น ์Šคํฌ๋ž˜ํ•‘

์›น ํŽ˜์ด์ง€์—์„œ ์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๊ฑฐ๋‚˜ ๊ฐ€์ ธ์˜ค๋Š” ๊ณผ์ •
โ€ข
ํŒŒ์ด์ฌ ๊ด€๋ จ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ : BeatifulSoup

urllib

URL ๊ด€๋ จ ํŒŒ์ด์ฌ ํ‘œ์ค€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
โ€ข
๋ชจ๋“ˆ
๋ชจ๋“ˆ
์„ค๋ช…
request
์š”์ฒญ ๊ด€๋ จ ๊ธฐ๋Šฅ
response
์‘๋‹ต ๊ด€๋ จ ๊ธฐ๋Šฅ
parse
URL ๋ฌธ์ž์—ด์„ ํŒŒ์‹ฑํ•˜๋Š” ๊ธฐ๋Šฅ
error
request ๋ชจ๋“ˆ์— ์˜ํ•ด ๋ฐœ์ƒํ•˜๋Š” ์˜ˆ์™ธ ํด๋ž˜์Šค ์ œ๊ณต
robotparser
robots.txt ํŒŒ์ผ์„ ๊ตฌ๋ฌธ ๋ถ„์„ํ•˜๋Š” ๊ธฐ๋Šฅ ์ œ๊ณต

urllib ํŒจํ‚ค์ง€

urllib.requests ๋ชจ๋“ˆ

URL ๋ฌธ์ž์—ด์„ ๊ฐ€์ง€๊ณ  HTTP ์š”์ฒญ์„ ์ˆ˜ํ–‰
โ€ข
์š”์ฒญ ๋ฐฉ์‹์„ GET, POST ๋ฐฉ์‹์œผ๋กœ ์ง€์ •ํ•œ URL ์— ์š”์ฒญํ•˜๊ธฐ
import urllib.request res = urllib.request.urlopen(" [URL] ") # GET ์š”์ฒญ res = urllib.request.urlopen(" [URL] ", data=xxx) # POST ์š”์ฒญ
Python
๋ณต์‚ฌ
โ€ข
request ๊ฐ์ฒด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์š”์ฒญํ•˜๊ธฐ
import urllib.request url = "https://~~~.com" # ์š”์ฒญ์„ ๋ณด๋‚ผ URL request = urllib.request.Request(url, method="๋ฉ”์†Œ๋“œ๋ฐฉ์‹") # GET, POST, PUT, DELETE ๋“ฑ response = urllib.request.urlopen(request) # ์š”์ฒญ ๋ณด๋‚ด๊ธฐ result = response.read() # ์‘๋‹ต ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ print(result) # ์‘๋‹ต ๋ฐ์ดํ„ฐ ์ถœ๋ ฅ
Python
๋ณต์‚ฌ

์˜ˆ์‹œ ์ฝ”๋“œ

โ€ข
๋„ค์ด๋ฒ„ ๋ฉ”์ธ ํŽ˜์ด์ง€๋ฅผ ์š”์ฒญํ•˜๊ณ , 1000 byte ๋งŒ ์ถœ๋ ฅํ•˜๊ธฐ
โ€ข
๊ฐœ์ธ ํ™ˆํŽ˜์ด์ง€์— ์š”์ฒญ์„ ๋ณด๋‚ด๊ณ , ์‘๋‹ต ๋‚ด์šฉ๊ณผ ์‘๋‹ต ํ—ค๋” ํ™•์ธํ•˜๊ธฐ
โ€ข
์ด๋ฏธ์ง€ url ์ฃผ์†Œ ์š”์ฒญํ•˜๊ณ , ์ด๋ฏธ์ง€ ํŒŒ์ผ์„ ์ €์žฅํ•˜๊ธฐ

๋„ค์ด๋ฒ„ ๋ฉ”์ธ ํŽ˜์ด์ง€๋ฅผ ์š”์ฒญํ•˜๊ณ , 1000 byte ๋งŒ ์ถœ๋ ฅํ•˜๊ธฐ

import urllib.request res = urllib.request.urlopen("http://www.naver.com/") print(type(res)) print(res.status) print("NAVER ์›นํŽ˜์ด์ง€์˜ ์†Œ์Šค ๋‚ด์šฉ----------------------------------------------------------------") print(res.read(1000).decode('utf-8'))
Python
๋ณต์‚ฌ
โ€ข
res.status : ์‘๋‹ต ์ƒํƒœ์ฝ”๋“œ
โ€ข
res.read() : ์‘๋‹ต ํŽ˜์ด์ง€๋ฅผ ๋””์ฝ”๋”ฉํ•˜์ง€ ์•Š๊ณ  ๋ฐ”์ด๋„ˆ๋ฆฌ ํ…์ŠคํŠธ๋กœ ๊ฐ€์ ธ์˜จ๋‹ค
โ€ข
res.read().decode(โ€™utf-8โ€™) : ์‘๋‹ต ํŽ˜์ด์ง€๋ฅผ UTF-8 ๋ฌธ์ž์…‹์œผ๋กœ ๋””์ฝ”๋”ฉํ•œ๋‹ค.
โ€ข
res.read(1000).decode(โ€™utf-8โ€™) : ์‘๋‹ต ํŽ˜์ด์ง€๋ฅผ 1000 ๋ฐ”์ดํŠธ๋งŒ ๊ฐ€์ ธ์˜จ๋‹ค

๊ฐœ์ธ ํ™ˆํŽ˜์ด์ง€์— ์š”์ฒญ์„ ๋ณด๋‚ด๊ณ , ์‘๋‹ต ๋‚ด์šฉ๊ณผ ์‘๋‹ต ํ—ค๋” ํ™•์ธํ•˜๊ธฐ

import urllib.request res = urllib.request.urlopen("https://xn--pe5b27r.com/") print("[ header ์ •๋ณด ]----------") res_header = res.getheaders() for s in res_header : print(s) print("[ body ๋‚ด์šฉ ]-----------") print(res.read().decode('utf-8'))
Python
๋ณต์‚ฌ

์ด๋ฏธ์ง€ url ์ฃผ์†Œ ์š”์ฒญํ•˜๊ณ , ์ด๋ฏธ์ง€ ํŒŒ์ผ์„ ์ €์žฅํ•˜๊ธฐ

import requests from PIL import Image from io import BytesIO r = requests.get('์ด๋ฏธ์ง€ url ์ฃผ์†Œ') i = Image.open(BytesIO(r.content)) print(type(i)) i.save("./ํŒŒ์ผ๋ช….jpg")
Python
๋ณต์‚ฌ

์ €์žฅํ•œ ์ด๋ฏธ์ง€ ์—ด์–ด๋ณด๊ธฐ

from PIL import Image img = Image.open(fileanme) img.show()
Python
๋ณต์‚ฌ

BeautifulSoup

HTML, XML (๋งˆํฌ์—… ๋ฌธ์„œ) โ€œ์›น ํŽ˜์ด์ง€ ์ฝ”๋“œโ€์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ํŒŒ์ด์ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
์›นํŽ˜์ด์ง€ ์ฝ”๋“œ๋กœ๋ถ€ํ„ฐ ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์ถ”์ถœ
HTML, XML ํŒŒ์ผ์˜ ๋‚ด์šฉ์„ ์ฝ์–ด๋“œ๋ ค์„œ ๋ณ€ํ™˜์„ ํ•ด์•ผํ•œ๋‹ค.
ํŒŒ์‹ฑ (Parsing)
: ํŠน์ • ๋ฌธ๋ฒ• ๋˜๋Š” ํ˜•์‹์— ๋งž๊ฒŒ ๊ตฌ๋ฌธ์„ ํ•ด์„ํ•˜๊ณ  ๋ถ„์„ํ•˜๋Š” ๋„๊ตฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •

HTML ํŒŒ์‹ฑ ๊ณผ์ •

1.
BeautifulSoup ๋ชจ๋“ˆ import
2.
BeautifulSoup ๊ฐ์ฒด ์ƒ์„ฑ
a.
์ธ์ž1 : HTML
b.
์ธ์ž2 : parser ๊ฐ์ฒด
3.
HTML ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ์ฒด๊ฐ€ ์ƒ์„ฑ๋˜๊ณ , BeautifulSoup ํ†ตํ•ด ์ ‘๊ทผ ๊ฐ€๋Šฅ

BeautifulSoup ๋ชจ๋“ˆ import

from bs4 import BeautifulSoup
Python
๋ณต์‚ฌ
2.
BeautifulSoup ๊ฐ์ฒด ์ƒ์„ฑ
a.
์ธ์ž1 : HTML
b.
์ธ์ž2 : parser ๊ฐ์ฒด
bs = BeautifulSoup( html, 'html.parser' )
Python
๋ณต์‚ฌ
3.
HTML ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ์ฒด๊ฐ€ ์ƒ์„ฑ๋˜๊ณ , BeautifulSoup ํ†ตํ•ด ์ ‘๊ทผ ๊ฐ€๋Šฅ

ํƒœ๊ทธ ์ ‘๊ทผ

bs.ํƒœ๊ทธ๋ช…
Python
๋ณต์‚ฌ
โ€ข
์˜ˆ์‹œ
โ—ฆ
bs.div
โ—ฆ
bs.h1
โ—ฆ
bs.p

ํƒœ๊ทธ๋ช… ์ถ”์ถœ

bs.ํƒœ๊ทธ๋ช….name
Python
๋ณต์‚ฌ

ํƒœ๊ทธ ์†์„ฑ ์ถ”์ถœ

bs.ํƒœ๊ทธ๋ช…['์†์„ฑ๋ช…'] bs.ํƒœ๊ทธ๋ช….attrs
Python
๋ณต์‚ฌ

ํƒœ๊ทธ ์ปจํ…์ธ  ์ถ”์ถœ

bs.ํƒœ๊ทธ๋ช….string bs.ํƒœ๊ทธ๋ช….text bs.ํƒœ๊ทธ๋ช….contents bs.ํƒœ๊ทธ๋ช….get_text()
Python
๋ณต์‚ฌ

๋ถ€๋ชจ ํƒœ๊ทธ

bs.ํƒœ๊ทธ๋ช….parent
Python
๋ณต์‚ฌ

์ž์‹ ํƒœ๊ทธ

bs.ํƒœ๊ทธ๋ช….children
Python
๋ณต์‚ฌ

ํ˜•์ œ ํƒœ๊ทธ

bs.ํƒœ๊ทธ๋ช….next_sibling bs.ํƒœ๊ทธ๋ช….next_siblings bs.ํƒœ๊ทธ๋ช….previous_sibling bs.ํƒœ๊ทธ๋ช….previous_siblings
Python
๋ณต์‚ฌ

์ž์† ํƒœ๊ทธ

bs.ํƒœ๊ทธ๋ช….descendants
Python
๋ณต์‚ฌ

์›น ์Šคํฌ๋ž˜ํ•‘ ์‹ค์Šต

โ€ข
BeautifulSoup ๋ชจ๋“ˆ import ํ•˜๊ณ , ๊ฐ์ฒด ์ƒ์„ฑํ•˜๊ธฐ
โ€ข
ํƒœ๊ทธ, ํƒœ๊ทธ๋ช…, ์†์„ฑ๊ฐ’ ์ ‘๊ทผํ•˜๊ธฐ

์˜ํ™”์ œ๋ชฉ ๊ฐ€์ ธ์˜ค๊ธฐ

# ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ # ํ„ฐ๋ฏธ๋„ > # pip install requests # pip install beautifulsoup4 # pip install lxml import requests from bs4 import BeautifulSoup # ํŠน์ • ์‚ฌ์ดํŠธ์˜ html ๊ฐ€์ ธ์˜ค๊ธฐ url = "https://movie.naver.com/movie/bi/mi/basic.naver?code=74977" html = requests.get(url) # print(html) # html ๋ถ„์„ soup = BeautifulSoup(html.text, 'lxml') # ์˜ํ™” ์ œ๋ชฉ h3 = soup.find('h3', class_='h_movie') a = h3.find('a') text = a.get_text() # print(h3) # print(a) print(text)
Python
๋ณต์‚ฌ

๋‰ด์Šค ๊ธฐ์‚ฌ ๊ฐ€์ ธ์˜ค๊ธฐ

# ์›น ํฌ๋กค๋ง # pip install requests # pip install beautifulsoup4 # pip install lxml # import : ๋ชจ๋“ˆ, ํŒจํ‚ค์ง€๋ฅผ ํฌํ•จํ•˜๋Š” ํ‚ค์›Œ๋“œ import requests from bs4 import BeautifulSoup # 'User-Agent' ํ—ค๋” ์ถ”๊ฐ€ (์‚ฌ์šฉ์ž ์ •๋ณด) # http://m.avalon.co.kr/check.html <- ์—ฌ๊ธฐ์„œ ๋ณต์‚ฌ headers = { 'User-Agent' : ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'), } url = "https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=105" html = requests.get(url, headers=headers) print(html) # html ๋ถ„์„ soup = BeautifulSoup(html.text, 'lxml') # # ์„ ํƒ์ž๋กœ ์ง€์ •ํ•ด์„œ ํƒœ๊ทธ ๊ฐ€์ ธ์˜ค๊ธฐ newsList = soup.select('.sh_item') # ๋‰ด์Šค ์ œ๋ชฉ # ๊ธฐ์‚ฌ๋‚ด์šฉ # ์‹ ๋ฌธ์‚ฌ # -------------- # ์ œ๋ชฉ (์‹ ๋ฌธ์‚ฌ) # : ๊ธฐ์‚ฌ๋‚ด์šฉ # -------------- for news in newsList: title = news.select('.sh_text_headline')[0].get_text() company = news.select('.sh_text_press')[0].get_text() content = news.select('.sh_text_lede')[0].get_text() print('----------------------------') print('{} ({})'.format(title, company)) print(' : {}'.format(content)) print('----------------------------') # ์œ„์˜ ํ˜•์‹์œผ๋กœ ์ถœ๋ ฅํ•ด๋ณด์„ธ์š”... # print(news)
Python
๋ณต์‚ฌ