爬取ajax网页，请求指教如果获得ajax参数的值 400 请求报错

大家好，我在爬取http://hair.allwomenstalk.com/，希望获得图片，图片上的文字以及图片的链接文章地址，这个网站是瀑布流的，按普通爬虫方式写代码爬取不了，我只能模拟发送ajax请求去获取首次加载时没有加载出来的内容，他的加载方式是向下拉滚动条时触发load方法调用js加载新内容，代码如下：

<script type="text/javascript">
	function downloadJSAtOnload() {
	if (awsL10n.isSingle) {
	var scriptImgSource = document.createElement("script");
	scriptImgSource.src = "http://gem.allw.mn/ext/get_img_source.js";
	document.body.appendChild(scriptImgSource);
	}
	}

	// Check for browser support of event handling capability
	if (window.addEventListener)
	window.addEventListener("load", downloadJSAtOnload, false);
	else if (window.attachEvent)
	window.attachEvent("onload", downloadJSAtOnload);
	else window.onload = downloadJSAtOnload;
	</script>

我们可以看到他是调用了 http://gem.allw.mn/ext/get_img_source.js这个js文件，在这个js里，首先通过jquery ready传递参数：

jQuery('document').ready(function($) {
  var top = {
    name: 'top',
    img: 'img.post-image.size-full'
  };

  var internal = {
    name: 'internal',
    img: '#content img.size-full:visible'
  };

  // first load
  getImageSource(top);
  getImageSource(internal);

  on('postPageSwitched', function(e, num, url) {
    if (num===1) getImageSource(top);
    getImageSource(internal);
  });
});

如果不是首次加载，就执行

getImageSource(internal);

而 getImageSource方法代码如下：

function getImageSource(params) {
  var src = $(params.img).attr('src');
  if ( !src ) return;

  var img = src.match(/^http[^?]*/);
  img && !$('a[data-src="'+img+'"]').length && $.ajax({
    type: 'GET',
    crossDomain: true,
    url: 'http://gem.allw.mn/ext/get_img_source.php?img_url='+encodeURIComponent(img),
    dataType: 'jsonp',
    success: function(json) {
      if (typeof json==='object' && json.Data && json.Data.source_url) {
        var label = $(document.createElement('a'));
          label.css({
            position: 'absolute',
            right: 5,
            zIndex: 1,
            padding: '0 6px',
            marginTop: -22,
            opacity: 0.4,
            color: 'white',
            background: 'rgba(0, 0, 0, .3)',
            border: '1px solid white',
            'border-radius': 3,
            'font-size': 7,
            'line-height': '2.2em',
            'text-transform': 'uppercase'
          }).hover(
            function(){ $(this).css({ opacity: 0.8 }); },
            function(){ $(this).css({ opacity: 0.4 }); }
          );
          label.attr('href', json.Data.source_url);
          label.attr('target', '_blank');
          label.attr('data-src', img);
          label.text(json.Data.source_name);

        // tracking
        if ( typeof _gaq === 'object' )
          label.click(function(){
            _gaq.push(['_trackEvent', 'PhotoCredit', params.name, this.href]);
          });
        else console.log('Tracking not set');

        $(params.img).after(label);
      } else {
        console.log(typeof json==='object' && json.Errors);
      }
    }
  });

现在我读不明白getImageSource的头三行，首先传递进来的参数是

var internal = {
    name: 'internal',
    img: '#content img.size-full:visible'
  };

而第一行

var src = $(params.img).attr('src');

读params.img的src属性，我认为应该为空，肯定就会执行第二行：

var src = $(params.img).attr('src');

而实际页面是加载了新内容的，也就是说src不为空，我想请教下，src是怎么取到值的，通过什么方法能知道 src取到的值是什么呢？非常感谢回答！！！

其实不用看源码，借住Chrome Developer Tool的Network面板就能知道应该发送什么参数。我大概看了一下，获取更多内容是通过JSONP的方式，其本质就是动态的加载js文件并执行其中的代码以获得服务器端返回的内容。
其中一条请求如下：
http://api2.allwomenstalk.com/posts/category/hair?callback=jQuery18305815418781712651_1389192501094&count=40&next_page=eyJmIjoxMzc1NDUxNTYyfQ%3D%3D&format=jsonp&_=1389192528184
其中：

callback 参数是js文件加载完成后要执行的js函数名
count就是返回的内容数量
next_page是一个base64 encode过后的对象，decode后的内容大概是 {"f":1361921071}，而1361921071这个数字应该是一个时间戳（猜测）
format 是返回的内容形式，将jsonp改为json，可以不设置callback参数，返回的就是plain json对象
_ 是一个时间戳，防止浏览器缓存ajax结果

而这个请求的地址是 http://api2.allwomenstalk.com/posts/category/hair 综上所述，你如果想要获取这个地址对应的内容，可以尝试 http://api2.allwomenstalk.com/posts/category/hair?count=80&format=json 由于count又有最大限制，所以关键参数在于next_page。好在next_page会在每次请求之后返回，返回的json大致如下： {"posts": array[123], "next_page": "123123123123"} 因此要获取更多内容，把上一次返回的next_page encode后附上即可！ ######非常感谢，按您的方法，我获得了需要的东西，接下来解析这些文本封装成我需要的对象，然后按普通爬虫去写程序就可以，非常感谢！！######楼主要做个专门看美女的网站么

###### 丫的看毛鸡巴源码啊，打开chrome console加载几页，看一下url参数就搞定。。还分析什么源码，你不要告诉我你一直在用IE开发 ######要爬取ajax返回的数据，可以参考这篇文章：http://doc.shenjianshou.cn/developmentSkills/useAJAX.html

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

爬取ajax网页，请求指教如果获得ajax参数的值 400 请求报错

相关文章