getsitemap
Node.js module that recursively crawls a website's sitemap and returns a stream of URLs
Last updated 5 months ago by sunnypurewal .
ISC · Repository · Bugs · Original npm · Tarball · package.json
$ cnpm install getsitemap 
SYNC missed versions from official npm registry.

getsitemap

getsitemap is a library that takes a domain name as input and returns a stream of <url> objects from the <urlset> elements of the website's sitemap.xml file(s). It can be used for obtaining a list of pages to crawl from a website. The objects in the stream will match the sitemap protocol:

{
  url: "http//newyorktimes.com", // Always present
  lastmod: "2019-10-01" // Optional
}

See Turbo Crawl for a powerful web crawling library based on getsitemap.

Usage

Streaming the URL set to a file. The file will be of ndjson type, which means that each line will be a JSON object. Note that this will not be a valid JSON file but is useful for reading large files line-by-line.

const getsitemap = require("getsitemap")

const url = "theintercept.com"
const since = Date.parse("2019-10-01")

const mapper = new getsitemap.SiteMapper(url)
const sitemapstream = mapper.map(since)
const file = fs.createWriteStream(`./intercept.ndjson`)
sitemapstream.pipe(file)
/* OR */
const sitemapstream = mapper.map(since)
sitemapstream.on("data", (obj) => {
  // obj.url, obj.lastmod
})

Configuration

getsitemap uses hittp under the hood to make HTTP requests, and by default it will delay requests to the same host for 3 seconds so as to not overload the server. getsitemap can be configured in the same way as hittp:

const getsitemap = require("getsitemap")

const url = "theintercept.com"
const since = Date.parse("2019-10-01")
const options = { delay_ms: 3000, cachePath: "./.hittp/cache } // Default

const mapper = new getsitemap.SiteMapper()
mapper.map(url, since, options).then((sitemapstream) => {
  const file = fs.createWriteStream(`./intercept.ndjson`)
  sitemapstream.pipe(file)
})

Don't forget to add your cache path to .gitignore! Default path is ./.hittp

Current Tags

  • 0.11.0                                ...           latest (5 months ago)

38 Versions

  • 0.11.0                                ...           5 months ago
  • 0.10.0                                ...           6 months ago
  • 0.9.2                                ...           7 months ago
  • 0.9.1 [deprecated]           ...           7 months ago
  • 0.9.0 [deprecated]           ...           7 months ago
  • 0.8.6 [deprecated]           ...           7 months ago
  • 0.8.5 [deprecated]           ...           7 months ago
  • 0.8.4 [deprecated]           ...           7 months ago
  • 0.8.3 [deprecated]           ...           7 months ago
  • 0.8.2 [deprecated]           ...           7 months ago
  • 0.8.1 [deprecated]           ...           7 months ago
  • 0.8.0 [deprecated]           ...           7 months ago
  • 0.7.14 [deprecated]           ...           7 months ago
  • 0.7.13 [deprecated]           ...           7 months ago
  • 0.7.12 [deprecated]           ...           7 months ago
  • 0.7.11 [deprecated]           ...           7 months ago
  • 0.7.10 [deprecated]           ...           7 months ago
  • 0.7.9 [deprecated]           ...           7 months ago
  • 0.7.8 [deprecated]           ...           7 months ago
  • 0.7.7 [deprecated]           ...           7 months ago
  • 0.7.6 [deprecated]           ...           7 months ago
  • 0.7.5 [deprecated]           ...           7 months ago
  • 0.7.4 [deprecated]           ...           7 months ago
  • 0.7.3 [deprecated]           ...           7 months ago
  • 0.7.2 [deprecated]           ...           7 months ago
  • 0.7.1 [deprecated]           ...           7 months ago
  • 0.7.0 [deprecated]           ...           7 months ago
  • 0.6.2 [deprecated]           ...           7 months ago
  • 0.6.1 [deprecated]           ...           7 months ago
  • 0.6.0 [deprecated]           ...           7 months ago
  • 0.5.0 [deprecated]           ...           8 months ago
  • 0.4.0 [deprecated]           ...           8 months ago
  • 0.3.1 [deprecated]           ...           8 months ago
  • 0.3.0 [deprecated]           ...           8 months ago
  • 0.2.1 [deprecated]           ...           8 months ago
  • 0.2.0 [deprecated]           ...           8 months ago
  • 0.1.1 [deprecated]           ...           8 months ago
  • 0.1.0 [deprecated]           ...           8 months ago
Maintainers (1)
Downloads
Today 0
This Week 0
This Month 0
Last Day 0
Last Week 0
Last Month 0
Dependencies (2)
Dev Dependencies (2)
Dependents (2)

Copyright 2014 - 2017 © taobao.org |