The Websites Crawler lets you crawl and extract content from multiple pages of a website by following internal links.
You submit one or more starting URLs and define how deep the crawler should go using maxDepth
. You can also limit which paths should be followed using regex with includePaths
.
This scraper job is asynchronous. You’ll receive a jobId
, and results can be fetched via polling or delivered to a webhook.
Example Request
curl --request POST \
--url 'https://api.hasdata.com/scrapers/crawler/jobs' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <your-api-key>' \
--data '{
"urls": [
"https://example.com"
],
"maxDepth": 3,
"includePaths": "(blog/.+|articles/.+)",
"outputFormat": ["text", "json"],
"webhook": {
"url": "https://yourdomain.com/webhook",
"events": ["scraper.job.started", "scraper.job.finished", "scraper.data.scraped"]
}
}'
Use Web Scraping API Params
You can use any parameters from the Web Scraping API inside a Websites Crawler job — including:
extractRules
aiExtractRules
headers
proxyType
/ proxyCountry
blockResources
, jsScenario
, outputFormat
, and more
All parameters are applied to each crawled page individually.
Get Scraper Job Status
To get the status of an existing scraper job, make a GET request to the endpoint /scrapers/jobs/:jobId
:
curl --location 'https://api.hasdata.com/scrapers/jobs/:jobId' \
--header 'x-api-key: <your-api-key>'
{
"id" : "dd1a8c53-2d47-4444-977d-8d653a6a3c82" ,
"status" : "finished" ,
"creditsSpent" : 200 ,
"dataRowsCount" : 20 ,
"data" : {
"csv" : "https://api.hasdata.com/scrapers/jobs/dd1a8c53-2d47-4444-977d-8d653a6a3c82/results/b6cc6733-6d0e-4e44-9e94-38688aad3884.csv" ,
"json" : "https://api.hasdata.com/scrapers/jobs/dd1a8c53-2d47-4444-977d-8d653a6a3c82/results/9cb592e3-6700-42ff-b58c-e7da3f478f28.json" ,
"xlsx" : "https://api.hasdata.com/scrapers/jobs/dd1a8c53-2d47-4444-977d-8d653a6a3c82/results/ecea853c-e0ca-4a23-ae74-eea0588e54b6.xlsx"
},
"input" : {
"limit" : 25 ,
"urls" : [ "https://hasdata.com" , "https://example.com" ],
"maxDepth" : 5 ,
"includePaths" : "(blog/.+|articles/.+)" ,
"webhook" : {
"url" : "https://example.com/webhook" ,
"events" : [ "scraper.job.started" , "scraper.job.finished" , "scraper.data.scraped" ]
}
}
}
Webhook
The webhook will notify you of events related to the scraper job. Here is an example webhook payload for the scraper.data.scraped
event:
{
"event" : "scraper.data.scraped" ,
"timestamp" : "2025-04-11T14:30:00Z" ,
"jobId" : "dd1a8c53-2d47-4444-977d-8d653a6a3c82" ,
"jobStatus" : "in_progress" ,
"data" : [
{
"text" : "Extracted text here..." ,
"statusCode" : 200 ,
"statusText" : "OK" ,
"url" : "https://hasdata.com/blog" ,
"depth" : 1 ,
"title" : "Blog | HasData"
}
]
}