Python-HTML 페이지 읽기

beautifulsoup로 알려진 도서관. 이 라이브러리를 사용하여 html 태그의 값을 검색하고 페이지 제목 및 페이지의 헤더 목록과 같은 특정 데이터를 가져올 수 있습니다.

Beautifulsoup 설치

Anaconda 패키지 관리자를 사용하여 필요한 패키지와 종속 패키지를 설치하십시오.

conda install Beaustifulsoap

HTML 파일 읽기

아래 예제에서 우리는 파이썬 환경에로드 할 URL을 요청합니다. 그런 다음 html 파서 매개 변수를 사용하여 전체 html 파일을 읽습니다. 다음으로 html 페이지의 처음 몇 줄을 인쇄합니다.

import urllib2
from bs4 import BeautifulSoup

# Fetch the html file
response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()

# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')

# Format the parsed html file
strhtm = soup.prettify()

# Print the first few characters
print (strhtm[:225])

위 코드를 실행하면 다음과 같은 결과가 나옵니다.

<!DOCTYPE html>
<!--[if IE 8]><html class="ie ie8"> <![endif]-->
<!--[if IE 9]><html class="ie ie9"> <![endif]-->
<!--[if gt IE 9]><!-->
<html>
 <!--<![endif]-->
 <head>
  <!-- Basic -->
  <meta charset="utf-8"/>
  <title>

태그 값 추출

다음 코드를 사용하여 태그의 첫 번째 인스턴스에서 태그 값을 추출 할 수 있습니다.

import urllib2
from bs4 import BeautifulSoup

response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()

soup = BeautifulSoup(html_doc, 'html.parser')

print (soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)

위 코드를 실행하면 다음과 같은 결과가 나옵니다.

Python Overview
Python Overview
None
Python is Interpreted

모든 태그 추출

다음 코드를 사용하여 태그의 모든 인스턴스에서 태그 값을 추출 할 수 있습니다.

import urllib2
from bs4 import BeautifulSoup

response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')

for x in soup.find_all('b'): print(x.string)

위 코드를 실행하면 다음과 같은 결과가 나옵니다.

Python is Interpreted
Python is Interactive
Python is Object-Oriented
Python is a Beginner's Language
Easy-to-learn
Easy-to-read
Easy-to-maintain
A broad standard library
Interactive Mode
Portable
Extendable
Databases
GUI Programming
Scalable