크롤링(Crawling)
- 인터넷의 데이터를 활용하기 위해 정보들을 분석하고 활용할 수 있도록 수집하는 행위
스크레이핑(Scraping)
- 크롤링 + 데이터를 추출하고 가공하는 행위
❤ Basic English Speaking
import requests
from bs4 import BeautifulSoup
site = 'https://basicenglishspeaking.com/daily-english-conversation-topics/'
request = requests.get(site)
print(request)
# print(request.text)
soup = BeautifulSoup(request.text)
# print(soup)
divs = soup.find('div', {'class':'thrv-columns'})
print(divs)
links = divs.findAll('a')
print(links)
for link in links:
print(link.text)
subject = []
for link in links:
subject.append(link.text)
subject
len(subject)
75
print('총', len(subject), '개의 주제를 찾았습니다')
for i in range(len(subject)):
print('{0:2d}, {1:s}'.format(i+1, subject[i]))
❤ 다음 뉴스 기사
# https://v.daum.net/v/20240520060022001
# https://v.daum.net/v/20240520094402076
# https://v.daum.net/v/20240520052002727
def daum_news_title(new_id):
url = 'https://v.daum.net/v/{}'.format(new_id)
request = requests.get(url)
soup = BeautifulSoup(request.text)
title = soup.find('h3', {'class':'tit_view'})
if title:
return title.text.strip()
return '제목없음'
daum_news_title('20240520094402076')
❤ 벅스 뮤직 차트
request = requests.get('https://music.bugs.co.kr/chart')
soup = BeautifulSoup(request.text)
titles= soup.findAll('p', {'class':'title'})
# print(titles)
artists = soup.findAll('p', {'class': 'artist'})
# print(artists)
for i, (t, a) in enumerate(zip(titles, artists)):
title = t.text.strip()
artist = a.text.strip().split('\n')[0]
print('{0:3d}위. {1:s} - {2:s}'.format(i+1, title, artist))
❤ 멜론
request = requests.get('https://www.melon.com/index.htm')
print(request) # <Response [406]>
- robots.txt : 웹 사이트에 크롤러같은 로봇들의 접근을 제어하기 위한 규약 (권고안이라 꼭 지킬 의무는 없음)
# User-Agent:
# Mozillaa/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
request = requests.get('https://www.melon.com/chart/index.htm', headers=header)
print(request)
soup = BeautifulSoup(request.text)
titles = soup.findAll('div', {'class': 'rank01'})
# print(titles)
artists = soup.findAll('span', {'class': 'checkEllipsis'})
# print(artists)
for i, (t, a) in enumerate(zip(titles, artists)):
title = t.text.strip()
artist = a.text.strip()
print('{0:3d}위. {1:s} - {2:s}'.format(i+1, title, artist))
❤ 네이버 증권
# https://finance.naver.com/item/main.naver?code=032800
# https://finance.naver.com/item/main.naver?code=053950
# 이름, 가격, 종목코드, 거래량
# {'name': '경남제약', 'price': '1545', 'code': '053950', 'volumn': '1828172'}
site = 'https://finance.naver.com/item/main.naver?code=053950'
request = requests.get(site)
print
soup = BeautifulSoup(request.text)
div_totalinfo = soup.find('div', {'class': 'new_totalinfo'})
# print(div_totalinfo)
name = div_totalinfo.find('h2').text
print(name)
div_today = div_totalinfo.find('div', {'class': 'today'})
# print(div_today)
price = div_today.find('span', {'class':'blind'}).text
print(price)
div_description = div_totalinfo.find('div', {'class':'description'})
# print(div_description)
code = div_description.find("span", {"class": "code"}).text
print(code)
table_no_info = soup.find('table', {'class': 'no_info'})
# print(table_no_info)
tds = table_no_info.findAll('td')
# print(tds)
volumn = tds[2].find('span', {'class': 'blind'}).text
print(volumn)
dic = {'name' : name, 'code' : code, 'price' : price, 'volumn' : volumn}
dic
def naver_finance(code):
site = f'https://finance.naver.com/item/main.naver?code={code}'
request = requests.get(site)
soup= BeautifulSoup(request.text)
div_totalinfo = soup.find('div', {'class' : 'new_totalinfo'})
name = div_totalinfo.find('h2').text
div_today = div_totalinfo.find('div', {'class' : 'today'})
price = div_today.find('span', {'class' : 'blind'}).text
div_description = div_totalinfo.find('div', {'class' : 'description'})
code = div_description.find('span', {'class' : 'code'}).text
table_no_info = soup.find('table', {'class' : 'no_info'})
tds = table_no_info.findAll('td')
volumn = tds[2].find('span', {'class' : 'blind'}).text
dic = {'name' : name, 'code' : code, 'price' : price, 'volumn' : volumn}
return dic
import pandas as pd
df = pd.DataFrame(data)
df
df.to_excel('naver_finance.xlsx')
'코딩 > AI' 카테고리의 다른 글
데이터 분석, 판다스 (0) | 2024.05.23 |
---|---|
데이터 분석, 넘파이 (0) | 2024.05.23 |
Web Crawling, 3일차 (0) | 2024.05.23 |
Web Crawling, 과제1 (0) | 2024.05.21 |
Web Crawling, 2일차 (0) | 2024.05.21 |