-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
result is empty for any url from domain http://www.mckinsey.com #501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This website might have anti-scraping protection, which is usually triggered by headless browsers. Try setting |
Thanks for your answer. It worked, but not as expected. See details below. First: the original problem
I have managed it in this way:
source and credit: https://colab.research.google.com/drive/1or8DtXZP8ZxJYK52me0dA6O9A1dXKKOE?usp=sharing Worked. Second: I got several "\n" and other into 'result' I got the following:
As I saw in the documentation, the return should be clean, plain text. What am I doing wrong or missing here? |
I'm glad the first solutionw worked, at least partially.
The second part of the problem is harder to tackle. I confirm that the output should be plain text. However, the answer is generated by an LLM, with all the problems that may stem from that, and our library uses the same system prompt for all LLMs. It took weeks of tinkering with the prompts just to reduce the amount of invalid JSON responses, and still, sometimes the output looks weird. Unless we come up with separate prompts for each model, or with a custom LLM fine-tuned for scraping, this kind of unexpected behavior will keep on showing up from time to time. |
Describe the bug
All URLs from domain is returning empty result
To Reproduce
Domain: http://www.mckinsey.com
URLs tested and not working:
https://www.mckinsey.com/features/mckinsey-center-for-future-mobility/our-insights/autonomous-vehicles-moving-forward-perspectives-from-industry-leaders
https://www.mckinsey.com/industries/automotive-and-assembly/our-insights/autonomous-drivings-future-convenient-and-connected
Prompt: Summarize and find the main topics
My code:
Steps to reproduce the behavior:
I got this from McKinsey URLs
Expected behavior
The text was updated successfully, but these errors were encountered: