python meminta kesalahan POST, masalah sesi?
Saya mencoba meniru tindakan browser berikut melalui python requests:
- Mendarat di https://www.bundesanzeiger.de/pub/en/to_nlp_start
- Klik "Opsi pencarian lainnya"
- Klik checkbox "Juga mencari data historicised" (bersesuaian dengan POST param:
isHistorical: true) - Klik tombol "Cari posisi pendek bersih"
- Klik tombol "Als CSV herunterladen" untuk mendownload file csv
Ini adalah kode yang saya harus mensimulasikan ini:
import requests
import re
s = requests.Session()
r = s.get("https://www.bundesanzeiger.de/pub/en/to_nlp_start", verify=False, allow_redirects=True)
matches = re.search(
r'form class="search-form" id=".*" method="post" action="\.(?P<appendtxt>.*)"',
r.text
)
request_url = f"https://www.bundesanzeiger.de/pub/en{matches.group('appendtxt')}"
sr = session.post(request_url, data={'isHistorical': 'true', 'nlp-search-button': 'Search net short positions'}, allow_redirects=True)
Namun, meskipun srmemberi saya status_code 200, itu benar-benar kesalahan ketika saya memeriksa sr.url, yang menunjukkanhttps://www.bundesanzeiger.de/pub/en/error-404?9
Menggali sedikit lebih dalam, saya perhatikan bahwa di request_urlatas memutuskan sesuatu seperti
https://www.bundesanzeiger.de/pub/en/nlp;wwwsid=EFEB15CD4ADC8932A91BA88B561A50E9.web07-pub?0-1.-nlp~filter~form~panel-form
tetapi ketika saya memeriksa url permintaan di Chrome, sebenarnya itu
https://www.bundesanzeiger.de/pub/en/nlp?87-1.-nlp~filter~form~panel-form`
Di 87sini tampaknya berubah, menunjukkan itu beberapa ID sesi, tetapi ketika saya melakukan ini menggunakannya requeststampaknya tidak menyelesaikan dengan benar.
Tahu apa yang saya lewatkan di sini?
Jawaban
Anda dapat mencoba skrip ini untuk mendownload file CSV:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bundesanzeiger.de/pub/en/to_nlp_start'
data = {
'fulltext': '',
'positionsinhaber': '',
'ermittent': '',
'isin': '',
'positionVon': '',
'positionBis': '',
'datumVon': '',
'datumBis': '',
'isHistorical': 'true',
'nlp-search-button': 'Search+net+short+positions'
}
headers = {
'Referer': 'https://www.bundesanzeiger.de/'
}
with requests.session() as s:
soup = BeautifulSoup(s.get(url).content, 'html.parser')
action = soup.find('form', action=lambda t: 'nlp~filter~form~panel-for' in t)['action']
u = 'https://www.bundesanzeiger.de/pub/en' + action.strip('.')
soup = BeautifulSoup( s.post(u, data=data, headers=headers).content, 'html.parser' )
a = soup.select_one('a[title="Download as CSV"]')['href']
a = 'https://www.bundesanzeiger.de/pub/en' + a.strip('.')
print( s.get(a, headers=headers).content.decode('utf-8-sig') )
Cetakan:
"Positionsinhaber","Emittent","ISIN","Position","Datum"
"Citadel Advisors LLC","LEONI AG","DE0005408884","0,62","2020-08-21"
"AQR Capital Management, LLC","Evotec SE","DE0005664809","1,10","2020-08-21"
"BlackRock Investment Management (UK) Limited","thyssenkrupp AG","DE0007500001","1,50","2020-08-21"
"BlackRock Investment Management (UK) Limited","Deutsche Lufthansa Aktiengesellschaft","DE0008232125","0,75","2020-08-21"
"Citadel Europe LLP","TAG Immobilien AG","DE0008303504","0,70","2020-08-21"
"Davidson Kempner European Partners, LLP","TAG Immobilien AG","DE0008303504","0,36","2020-08-21"
"Maplelane Capital, LLC","VARTA AKTIENGESELLSCHAFT","DE000A0TGJ55","1,15","2020-08-21"
...and so on.
Jika Anda memeriksa https://www.bundesanzeiger.de/robots.txt, situs web ini tidak suka diindeks. Situs web tersebut dapat menolak akses ke agen pengguna default yang digunakan oleh bot. Ini mungkin membantu: Permintaan Python vs. robots.txt