Python solicita un error POST, ¿problema de sesión?

Aug 25 2020

Estoy tratando de imitar las siguientes acciones del navegador a través de Python requests:

Aterrizar https://www.bundesanzeiger.de/pub/en/to_nlp_start
Haga clic en "Más opciones de búsqueda"
Haga clic en casilla de verificación "también encontramos datos historicised" (corresponde al parámetro POST: isHistorical: true)
Haga clic en el botón "Buscar posiciones cortas netas"
Haga clic en el botón "Als CSV herunterladen" para descargar el archivo csv

Este es el código que tengo para simular esto:

import requests
import re

s = requests.Session()
r = s.get("https://www.bundesanzeiger.de/pub/en/to_nlp_start", verify=False, allow_redirects=True)

matches = re.search(
        r'form class="search-form" id=".*" method="post" action="\.(?P<appendtxt>.*)"',
        r.text
    )
request_url = f"https://www.bundesanzeiger.de/pub/en{matches.group('appendtxt')}"
sr = session.post(request_url, data={'isHistorical': 'true', 'nlp-search-button': 'Search net short positions'}, allow_redirects=True)

Sin embargo, aunque srme da un status_code 200, es realmente un error cuando lo verifico sr.url, lo que muestrahttps://www.bundesanzeiger.de/pub/en/error-404?9

Profundizando un poco más, noté que lo request_urlanterior se resuelve en algo como

https://www.bundesanzeiger.de/pub/en/nlp;wwwsid=EFEB15CD4ADC8932A91BA88B561A50E9.web07-pub?0-1.-nlp~filter~form~panel-form

pero cuando reviso la URL de la solicitud en Chrome, en realidad es

https://www.bundesanzeiger.de/pub/en/nlp?87-1.-nlp~filter~form~panel-form`

El 87aquí parece cambiar, lo que sugiere que es una ID de sesión, pero cuando estoy haciendo esto requests, no parece resolverse correctamente.

¿Alguna idea de lo que me estoy perdiendo aquí?

Respuestas

1 AndrejKesely Aug 25 2020 at 02:28

Puede probar este script para descargar el archivo CSV:

import requests
from bs4 import BeautifulSoup


url = 'https://www.bundesanzeiger.de/pub/en/to_nlp_start'

data = {
    'fulltext': '',
    'positionsinhaber': '',
    'ermittent': '',
    'isin': '',
    'positionVon': '',
    'positionBis': '',
    'datumVon': '',
    'datumBis': '',
    'isHistorical': 'true',
    'nlp-search-button': 'Search+net+short+positions'
}

headers = {
    'Referer': 'https://www.bundesanzeiger.de/'
}

with requests.session() as s:
    soup = BeautifulSoup(s.get(url).content, 'html.parser')

    action = soup.find('form', action=lambda t: 'nlp~filter~form~panel-for' in t)['action']
    u = 'https://www.bundesanzeiger.de/pub/en' + action.strip('.')    

    soup = BeautifulSoup( s.post(u, data=data, headers=headers).content, 'html.parser' )

    a = soup.select_one('a[title="Download as CSV"]')['href']
    a = 'https://www.bundesanzeiger.de/pub/en' + a.strip('.')    

    print( s.get(a, headers=headers).content.decode('utf-8-sig') )

Huellas dactilares:

"Positionsinhaber","Emittent","ISIN","Position","Datum"
"Citadel Advisors LLC","LEONI AG","DE0005408884","0,62","2020-08-21"
"AQR Capital Management, LLC","Evotec SE","DE0005664809","1,10","2020-08-21"
"BlackRock Investment Management (UK) Limited","thyssenkrupp AG","DE0007500001","1,50","2020-08-21"
"BlackRock Investment Management (UK) Limited","Deutsche Lufthansa Aktiengesellschaft","DE0008232125","0,75","2020-08-21"
"Citadel Europe LLP","TAG Immobilien AG","DE0008303504","0,70","2020-08-21"
"Davidson Kempner European Partners, LLP","TAG Immobilien AG","DE0008303504","0,36","2020-08-21"
"Maplelane Capital, LLC","VARTA AKTIENGESELLSCHAFT","DE000A0TGJ55","1,15","2020-08-21"


...and so on.

idkhowtocode Aug 24 2020 at 23:56

Si miras https://www.bundesanzeiger.de/robots.txt, este sitio web no desea ser indexado. El sitio web podría negar el acceso al agente de usuario predeterminado utilizado por los bots. Esto podría ayudar: solicitudes de Python frente a robots.txt