python3 - 크롤러 (cloudscraper)

programming/python

python3 - 크롤러 (cloudscraper)

sniffer-k 2023. 6. 3. 18:06

일반적으로 파이썬 크롤링에서는 "requests" 모듈을 이용하여 해당 페이지 데이터를 읽어온다

import requests

url = 'https://kr.investing.com/commodities/natural-gas'

headers = {'User-Agent' : 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.4587.173 Mobile Safari/537.36'}

html_1 = requests.get(url, headers=headers) #헤더를 넣고 요청
html_2 = requests.get(url) #헤더 없이 요청

requests 모듈을 사용하는경우 일반적인 홈페이지의 데이터는 대부분 읽어올수있다. 하지만 보안이 잘되어있거나 주식과 같은 상품 정보를 공개하는 웹페이지의 경우 정상 브라우저의 사용자가 아닌 크롤링을 통한 무차별적인 데이터 수집을 막고있다.

(ex -> investing.com ~ , requests.get() 함수로는 데이터를 가져올수 없다 )

웹서버에서는 크롤링 방지를 위해 대표적으로 cloudflare와 같은 제품을 사용하여 크롤링 방지를 수행하고있다

https://www.cloudflare.com/ko-kr/products/bot-management/

Cloudflare 봇 관리

Cloudflare 봇 관리는 대규모 위협 인텔리전스로 악성 봇을 차단하는 간단한 봇 관리 솔루션입니다.

www.cloudflare.com

Cloudflare를 우회하는 방법은 없을까 ?

있다..

바로 "cloudscraper" 이다

https://pypi.org/project/cloudscraper/

cloudscraper

A Python module to bypass Cloudflare's anti-bot page.

pypi.org

https://github.com/VeNoMouS/cloudscraper

GitHub - VeNoMouS/cloudscraper: A Python module to bypass Cloudflare's anti-bot page.

A Python module to bypass Cloudflare's anti-bot page. - GitHub - VeNoMouS/cloudscraper: A Python module to bypass Cloudflare's anti-bot page.

github.com

pip install cloudscraper

명령어로 설치는 간편하게 할수있다 . (python 2.후반 ~ python 3.x 까지 지원하고있다)

기본적으로 사용하는 코드는 다음과 같다

#크롬 브라우저 크롤러 생성
scraper = cloudscraper.create_scraper(browser='chrome')
html = scraper.get("https://kr.investing.com/commodities/natural-gas-historical-data").content

#안드로이드, 크롬 크롤러 생성 
scraper_adnroid = cloudscraper.create_scraper(
    browser={
        'browser': 'chrome',
        'platform': 'android',
        'desktop': False
    }
)

html = scraper_adnroid.get("https://kr.investing.com/commodities/natural-gas-historical-data").content

#특정 페이지 쿠키 및 User Agent 얻기
cookie_arg, user_agent = cloudscraper.get_cookie_string('http://www.google.com')
print(f'{cookie_arg}')
print(f'{user_agent}')

728x90