본문 바로가기

코딩/AI

Web Crawling , 1일차

크롤링(Crawling)

- 인터넷의 데이터를 활용하기 위해 정보들을 분석하고 활용할 수 있도록 수집하는 행위

스크레이핑(Scraping)

- 크롤링 + 데이터를 추출하고 가공하는 행위

 

  Basic English Speaking

 

import requests
from bs4 import BeautifulSoup
site = 'https://basicenglishspeaking.com/daily-english-conversation-topics/'
request =  requests.get(site)
print(request)
# print(request.text)
soup = BeautifulSoup(request.text)
# print(soup)
divs = soup.find('div', {'class':'thrv-columns'})
print(divs)
links = divs.findAll('a')
print(links)
for link in links:
  print(link.text)

 

subject = []

for link in links:
    subject.append(link.text)
subject

len(subject)

75

print('총', len(subject), '개의 주제를 찾았습니다')
for i in range(len(subject)):
    print('{0:2d}, {1:s}'.format(i+1, subject[i]))

  다음 뉴스 기사

# https://v.daum.net/v/20240520060022001
# https://v.daum.net/v/20240520094402076
# https://v.daum.net/v/20240520052002727

def daum_news_title(new_id):
    url = 'https://v.daum.net/v/{}'.format(new_id)
    request = requests.get(url)
    soup =  BeautifulSoup(request.text)
    title = soup.find('h3', {'class':'tit_view'})
    if title:
        return title.text.strip()
    return '제목없음'
daum_news_title('20240520094402076')

 벅스 뮤직 차트

request = requests.get('https://music.bugs.co.kr/chart')
soup = BeautifulSoup(request.text)
titles= soup.findAll('p', {'class':'title'})
# print(titles)
artists = soup.findAll('p', {'class': 'artist'})
# print(artists)
for i, (t, a) in enumerate(zip(titles, artists)):
    title = t.text.strip()
    artist = a.text.strip().split('\n')[0]
    print('{0:3d}위. {1:s} - {2:s}'.format(i+1, title, artist))

  멜론

request = requests.get('https://www.melon.com/index.htm')
print(request)  # <Response [406]>

 

- robots.txt : 웹 사이트에 크롤러같은 로봇들의 접근을 제어하기 위한 규약 (권고안이라 꼭 지킬 의무는 없음)

 

# User-Agent:
# Mozillaa/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
request = requests.get('https://www.melon.com/chart/index.htm', headers=header)
print(request)
soup = BeautifulSoup(request.text)
titles = soup.findAll('div', {'class': 'rank01'})
# print(titles)
artists = soup.findAll('span', {'class': 'checkEllipsis'})
# print(artists)
for i, (t, a) in enumerate(zip(titles, artists)):
    title = t.text.strip()
    artist = a.text.strip()
    print('{0:3d}위. {1:s} - {2:s}'.format(i+1, title, artist))

  네이버 증권

# https://finance.naver.com/item/main.naver?code=032800
# https://finance.naver.com/item/main.naver?code=053950
# 이름, 가격, 종목코드, 거래량
# {'name': '경남제약', 'price': '1545', 'code': '053950', 'volumn': '1828172'}
site = 'https://finance.naver.com/item/main.naver?code=053950'
request = requests.get(site)
print
soup = BeautifulSoup(request.text)
div_totalinfo = soup.find('div', {'class': 'new_totalinfo'})
# print(div_totalinfo)
name = div_totalinfo.find('h2').text
print(name)

div_today = div_totalinfo.find('div', {'class': 'today'})
# print(div_today)
price = div_today.find('span', {'class':'blind'}).text
print(price)

div_description = div_totalinfo.find('div', {'class':'description'})
# print(div_description)
code = div_description.find("span", {"class": "code"}).text
print(code)

table_no_info = soup.find('table', {'class': 'no_info'})
# print(table_no_info)
tds = table_no_info.findAll('td')
# print(tds)
volumn = tds[2].find('span', {'class': 'blind'}).text
print(volumn)

dic = {'name' : name, 'code' : code, 'price' : price, 'volumn' : volumn}
dic

def naver_finance(code):
  site = f'https://finance.naver.com/item/main.naver?code={code}'
  request = requests.get(site)
  soup= BeautifulSoup(request.text)
  div_totalinfo = soup.find('div', {'class' : 'new_totalinfo'})
  name = div_totalinfo.find('h2').text
  div_today = div_totalinfo.find('div', {'class' : 'today'})
  price = div_today.find('span', {'class' : 'blind'}).text
  div_description = div_totalinfo.find('div', {'class' : 'description'})
  code = div_description.find('span', {'class' : 'code'}).text
  table_no_info = soup.find('table', {'class' : 'no_info'})
  tds = table_no_info.findAll('td')
  volumn = tds[2].find('span', {'class' : 'blind'}).text
  dic = {'name' : name, 'code' : code, 'price' : price, 'volumn' : volumn}
  return dic
import pandas as pd
df = pd.DataFrame(data)
df
df.to_excel('naver_finance.xlsx')

'코딩 > AI' 카테고리의 다른 글

데이터 분석, 판다스  (0) 2024.05.23
데이터 분석, 넘파이  (0) 2024.05.23
Web Crawling, 3일차  (0) 2024.05.23
Web Crawling, 과제1  (0) 2024.05.21
Web Crawling, 2일차  (0) 2024.05.21