bot-marvin
Highly scalable crawler with best features.
Last updated 4 years ago by tilak .
ISC · Repository · Bugs · Original npm · Tarball · package.json
$ cnpm install bot-marvin 
SYNC missed versions from official npm registry.

bot-marvin

Highly scalable crawler with best features.

Basic useful feature list:

  • Asynchronus crawling
  • Distributed Breadth first crawls
  • Scalable horizontally as well vertically
  • Url partitioning for better scheduling
  • Scheduling using fetch interval and priority
  • Supports robots.txt and sitemap.xml parsing
  • Uses Apache Tika for file parsing
  • Web app for viewing crawled data and analytics
  • Faul Tolerant and Auto Recovery on failures
  • Wide range support of all meta tags and http codes.
  • Support for all the tags advised by google crawl guide.
  • Creates web graph
  • Collects rss feeds and author info
  • Pluggable parsers
  • Pluggable indexers (currently MongoDB supported)

install

sudo npm install bot-marvin

Starting your first crawl

	//You need to create a seed.json file first
    //it looks like this
    [
         {
            "_id": "http://www.imdb.com",
            "parseFile": "nutch",
            "priority": 1,
            "fetch_interval": "monthly",
            "limit_depth": -1 
         },
         {
            "_id": "http://www.elastic.co",
            "parseFile": "nutch",
            "priority": 1,
            "fetch_interval": "monthly",
            "limit_depth": -1 
         },
         {
            "_id": "http://www.rottentomatoes.com",
            "parseFile": "nutch",
            "priority": 1,
            "fetch_interval": "monthly",
            "limit_depth": 10 
         }
    ]
    
    /*
    
    _id : is the url
    parseFile : is the file name present in parsers dir (default: 'nutch')
    priority : is from 1-100 indicates the percentage of urls of the domain in a single crawl job.
    Number of urls of a domain in batch = (priority/100) * batch_size
    Fetch interval is recrawl interval supported values (always|weekly|monthly|yearly) you can add custom time intervals in the config
    limit_depth: is used to restrict crawling by depth, -1 means no limit by depth
    
    */
    
# Step 1 Set your db configuration
sudo bot-marvin-db
# Step 2 Set your bot config
sudo bot-marvin --config 
# Step 3 Load your seed file
sudo bot-marvin --loadSeedFile <path_to_your_seed_file> 
# Step 4 Run your crawler
sudo bot-marvin

Contributing

  1. Fork it!
  2. Create your feature branch: git checkout -b my-new-feature
  3. Commit your changes: git commit -am 'Add some feature'
  4. Push to the branch: git push origin my-new-feature
  5. Submit a pull request :D

###Documentation is available at http://tilakpatidar.github.io/bot-marvin

Stuff used to make this:

Current Tags

  • 2.0.2                                ...           latest (4 years ago)

6 Versions

  • 2.0.2                                ...           4 years ago
  • 2.0.1                                ...           4 years ago
  • 2.0.0                                ...           4 years ago
  • 1.0.6                                ...           4 years ago
  • 1.0.5                                ...           4 years ago
  • 1.0.0                                ...           4 years ago
Maintainers (1)
Downloads
Today 0
This Week 0
This Month 0
Last Day 0
Last Week 0
Last Month 1
Dependencies (19)
Dev Dependencies (4)
Dependents (0)
None

Copyright 2014 - 2017 © taobao.org |