두 개의 테이블을 스크래핑하고 하나의 CSV에 쓰는 방법은 무엇입니까?
이 웹 사이트에서 두 테이블을 긁어 내려고합니다. https://www.nsw.gov.au/covid-19/latest-news-and-updates
이 단계에서 나는 단지 초기 출력을 얻는 것이 차단되었습니다. 스크레이퍼가 오류를 반환하지 않아서 문제를 볼 수 없습니다.
이상적으로는 Action에 대한 추가 열과 테이블 제목에 대한 값을 사용하여 두 테이블을 하나로 결합하고 싶습니다 (예는 아래 참조).
이것은 내가 사용하려고 시도한 코드입니다.
from bs4 import BeautifulSoup
from requests import get
from csv import writer
url = 'https://www.nsw.gov.au/covid-19/latest-news-and-updates'
r = get(url)
soup = BeautifulSoup(r.text, 'lxml')
tables = soup.find_all('nsw-table-responsive')
for num, table in enumerate(tables, start=1):
filename = 'covidstatus.csv' % num
with open(filename, 'w') as f:
data = []
csv_writer = writer(f)
rows = table.find_all('tr')
for row in rows:
headers = row.find_all('th')
if headers:
csv_writer.writerow([header.text.strip() for header in headers])
columns = row.find_all('td')
csv_writer.writerow([column.text.strip() for column in columns])
아래는 내 이상적인 출력의 예입니다.
Location,Dates,Action
Glebe: Jambo Jambo African Restaurant,7pm to 10:30pm on Friday 31 July 2020,Self-isolate and get tested immediately
Hamilton: Bennett Hotel,5:30pm to 10pm on Friday 31 July,Self-isolate and get tested immediately
Bankstown: BBQ City Buffet,7pm to 8.30pm on Saturday 1 August,Monitor for symptoms
Broadmeadow: McDonald Jones Stadium,7:30pm to the end of the Newcastle Jets match on Sunday 2 August,Monitor for symptoms
누구든지 이것에 대해 제공 할 수있는 모든 도움에 감사드립니다.
답변
2 AndrejKesely
이 스크립트는 데이터를 data.csv
다음에 저장합니다 .
import csv
import requests
from bs4 import BeautifulSoup
url = 'https://www.nsw.gov.au/covid-19/latest-news-and-updates'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('tr:has(td)'):
all_data.append(
[td.get_text(strip=True, separator='\n') for td in row.select('td')]
)
all_data[-1].append(row.find_previous('h4').text)
all_data[-1][0] = all_data[-1][0].replace('\n', '')
with open('data.csv', 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in all_data:
csv_writer.writerow(row)
data.csv
LibreOffice의 스크린 샷 :
편집 : (제목을 작성하려면) :
...
with open('data.csv', 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
csv_writer.writerow(['Location', 'Dates', 'Type'])
for row in all_data:
csv_writer.writerow(row)
AssadAli
다음은 작업 코드입니다. 질문이 있으면 알려주세요.
from bs4 import BeautifulSoup
from requests import get
from csv import writer
url = 'https://www.nsw.gov.au/covid-19/latest-news-and-updates'
r = get(url)
soup = BeautifulSoup(r.text, 'lxml')
tables = soup.find_all('table')
for num, table in enumerate(tables, start=1):
filename = 'covidstatus.csv'
with open(filename, 'w') as f:
data = []
csv_writer = writer(f)
rows = table.find_all('tr')
for row in rows:
headers = row.find_all('th')
if headers:
head = [header.text.strip() for header in headers]
print(head)
csv_writer.writerow([header.text.strip() for header in headers])
columns = row.find_all('td')
print([column.text.strip() for column in columns])
csv_writer.writerow([column.text.strip() for column in columns])
여기에 출력이 있습니다
['Location', 'Dates']
[]
['Hamilton: Sydney Junction Hotel', '11pm on Saturday 1 August to 1:15am on Sunday 2 August']
['Huskisson: Wildginger', '7:45pm to 10:30pm on Saturday 8 August']
['Lidcombe: Dooleys Lidcombe Catholic Club', '5pm on Friday 7 August to 6:30am on Saturday 8 August\xa0\n\t\t\t4:30pm to 11:30pm on Saturday 8 August\n\t\t\t1pm to 9pm on Sunday 9 August\n\t\t\t12pm to 9:30pm on Monday 10 August\xa0\nIf you were at this venue for at least 1 hour during any of these
times, you must self-isolate and get tested and stay isolated for 14 days after your last day at the venue within these dates. (Advice updated 16\xa0August)']
['Mollymook: Rick Stein at Bannisters', '8pm to 10:30pm on Saturday 1 August for at least one hour\nSelf-isolate until midnight 15 August or until you have received a negative result, whichever is later.']
['New Lambton: Bar 88 - Wests New Lambton', '5pm to 7:15pm on Sunday 2 August']
['Newcastle: Hamilton to Adamstown Number 26 bus', '8:20am on Monday 3 August']
['Location', 'Dates']
[]
[]
['Bowral:\xa0Horderns Restaurant at Milton Park Country House Hotel and Spa', '7:45pm to 9:15pm on\xa0Sunday 2 August']
['Broadmeadow: McDonald Jones Stadium', '7:30pm to the end of the Newcastle Jets match on Sunday 2 August']
['Campbelltown: Bunnings Warehouse', '11am to 7pm on Tuesday 4 August\xa0\n\t\t\t8am to 4pm on Wednesday 5 August\n\t\t\t1pm to 3pm on Thursday 6 August']
['Castle Hill:\xa0Castle Towers Shopping Centre', '3:30pm to 5pm on Friday\xa07 August']
['Cherrybrook:\xa0PharmaSave Cherrybrook Pharmacy in Appletree Shopping Centre', '4pm to 7pm on Thursday 6 August']
['Concord:\xa0Crust Pizza', '4pm to\xa08pm on\xa0Thursday 6 August\n\t\t\t5pm to 9pm on\xa0Friday 7 August']
['Double Bay:\xa0Café Perons', '1pm to 2pm on\xa0Saturday 8 August']
['Liverpool:\xa0Liverpool Hospital', '7am to 3pm on Thursday 6 August\n\t\t\t7am to 3pm on Friday 7 August\n\t\t\t5am to 1:30pm on Saturday 8 August\n\t\t\t5am to 1:30pm on Sunday 9 August']
['Liverpool: Westfield Liverpool', '10:30am to 11am and 12:30pm to 1pm on Friday 7 August']
['Marrickville: Woolworths -\xa0Marrickville Metro Shopping Centre', '7pm to 7:20pm on Sunday 2 August']
['Parramatta: Westfield Parramatta', '4pm to 5:30pm on Wednesday\xa05 August\n\t\t\t12pm to 1pm on Saturday 8 August']
['Pennant Hills: St Agatha's', '6:30 am to 7am on\xa0Wednesday 5 August\n\t\t\t6:30 am to 7am on Thursday 6 August']
['Penrith: Baby Bunting', '1:15pm to 1:45pm on Saturday 8 August']
['Rhodes: IKEA', '1:20pm to 2:20pm on Saturday 8 August']
['Rose Bay:\xa0Den Sushi', '7:15pm to 8:45pm on\xa0Saturday 8 August']
['Smithfield:\xa0Chopstix Asian Cuisine, Smithfield RSL', 'Friday 31 July to Saturday 9 August']
['Wetherill Park: 5th Avenue Beauty Bar', '2pm to 3pm\xa0on Saturday 8 August']
In [81]:
PraysonW.Daniel
가장 쉬운 방법은에서 사용하는 .read_html
것 Pandas
입니다. 팬더는 할 것입니다 requests
그리고 BeautifulSoup
당신을 :
import pandas as pd
URI = 'https://www.nsw.gov.au/covid-19/latest-news-and-updates'
# get tables
tables = pd.read_html(URI)
t1 = tables[0]
t2 = tables[1].dropna(axis=0)
# append tables
t = t1.append(t2, ignore_index=True)
# send tables to csv file
t.to_csv('my_table.csv', index=False, encoding='utf-8')
lxml, html5lib
Pandas .read_html
가 이러한 종속성을 필요 로 할 때 설치해야 할 수도 있습니다 .
결과 :