Questions tagged [web-crawler]
A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.
web-crawler
9,707
questions
0
votes
0
answers
15
views
Unable to Stop Running Sync Job in AWS Bedrock Knowledge Base
I have an issue with AWS Bedrock Knowledge Base, Web crawler as a data source, I have accidently put 2 URLs, of Wikipedia (e.g, "https://en.wikipedia.org/wiki/article1 and second URL: "https:...
-4
votes
1
answer
28
views
Crawl data in Top 250 Movies IDMb
Please, i need someone help me. I can't understand why I only crawl 25 movies instead of 250. My code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': '...
-1
votes
0
answers
8
views
Weblow pagination hurt SEO? [closed]
I'm using Webflow for a certain website and a lot of paginated pages end up in the GSC tab: crawled, currently not indexed.
For example: https://www.example.com/blog?65b097f7_page=5
Is this hurting ...
0
votes
1
answer
29
views
How to exclude div classes 'modal-content' and 'modal-body' from pyppeteer web scraper?
I'm building a scraper that gets text data from a list of articles. A common specimen in the text content I'm scraping at the minute is that at the bottom there is this message:
"As a subscriber, ...
0
votes
0
answers
11
views
Sudden increase in requests received
my application suddenly had a huge increase in the number of requests being made to it. I believe the only change of merit was adding a sitemap.xml and I believe the increase in requests is due to ...
0
votes
0
answers
7
views
Github Action _ Overwriting/replace/update .json file prblem
I want to use google web API to crap some coffee shop info from my country
then there is already a original version .json file in my repo to use, but if some new coffee shop be created ,I need to ...
0
votes
0
answers
19
views
AWS crawler creating Null values for partion columns
I am having some country level partitioned data in s3 and crawler is crawling the this root folder and creating a table. No Null value is there for country code. But when looked in the Athena, there ...
-3
votes
0
answers
37
views
Download ICD-10 codes (International Classification of Diseases)
We can easily browse the ICD-10 codes: https://icd.who.int/browse10/2019/en
Unfortunately, there is no way to download all of the codes as TXT (or XLS) file in order to parse with Python, or import ...
-1
votes
0
answers
20
views
crawler - rotten tomatoes website - problem with pages
im trying to crawl the website rotten tomatoes but i have a problem:
to get the html for page 5 and above of the movies for example:
https://www.rottentomatoes.com/browse/movies_at_home/?page=**8**
...
1
vote
1
answer
62
views
Scrapy Spider does not work with multiple urls
I wrote a Scrapy spider and used Selenium in it to scrape the products in devgrossonline.com.
It does not work with multiple category urls, but it works when I provide only one url.
Here is my spider:
...
-1
votes
0
answers
22
views
The time obtained by the Python crawler is incorrect when getting comments
When I use Python to crawl stock comments from a website, the time parsed from the website is different from the time obtained by my crawler.
For example:
when use the F12 to detect the website,i find ...
0
votes
1
answer
37
views
TYPO3 indexed search fails to index PDF files
I'm hoping to get help with a problem I can't solve. The working environment is as follows:
SYSTEM
Debian 12 bookworm
PHP 7.4 (tried 8.2 and 8.3 with failure on crawler) + FPM/FastCGI
/usr/bin/...
0
votes
0
answers
13
views
How to download PDFs using Norconex Web Crawler?
I have tried to download PDFs from certain URLs (e.g. https://example.com) using the Norconex Web Crawler (v3.0) and the configuration below but no luck. Can someone please help me with this?
<?xml ...
0
votes
0
answers
40
views
Getting subsequent GET calls for some PUT, POST APIs in web site
I'm observing subsequent GET calls for some PUT, POST API. I already checked the code and there is no GET calls created for those endpoints. But I'm seeing this call in my server logs.
Say for example ...
-2
votes
0
answers
40
views
TikTok finding username with videoID
I am currently working on a project that deals with the data of the DSA transparency data base. Specifically, I am looking at the TikTok data. Now I would like to go one step further and check if the ...