jsoup+httpclient获取sina、51博文内容

本文涉及的产品
云解析 DNS,旗舰版 1个月
全局流量管理 GTM,标准版 1个月
公共DNS(含HTTPDNS解析),每月1000万次HTTP解析
简介:

涉及的demo下载RometePro.rar ,编码utf-8

两大jar简介

HttpClient(要解析的网页内容)

HttpClient 功能介绍

以下列出的是 HttpClient 提供的主要的功能,要知道更多详细的功能可以参见 HttpClient 的主页。

  • 实现了所有 HTTP 的方法(GET,POST,PUT,HEAD 等)

  • 支持自动转向

  • 支持 HTTPS 协议

  • 支持代理服务器等



jsoup(强大的网页内容解析,也可以做网页内容下载,但是网页处理等方面没有httpclient强大)

  jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。(比HTMLParser优秀多了)

jsoup的主要功能如下:

1. 从一个URL,文件或字符串中解析HTML;
2. 使用DOM或CSS选择器来查找、取出数据;
3. 可操作HTML元素、属性、文本;
jsoup是基于MIT协议发布的,可放心使用于商业项目。
jsoup 的主要类层次结构如下图所示:

wKioL1MRl2aTxisOAAD1kSEEmM4005.jpg


下载httpclient

官网下载

网盘下载

下载JSOUP

官网下载

网盘下载

涉及的demo下载RometePro.rar ,编码utf-8


先来个效果

 sina博文解析内容,原地址:http://blog.sina.com.cn/s/blog_89cc52f20101d1sh.html


 textview内容显示的效果有以下

 1.有链接的自动设置链接(android:autoLink="all")

 2.链接地址可以像editview一样选中(可以通过触摸移动来选中链接地址),然后长安弹出复制对话框

 3.单击链接跳转到浏览器中

105438321.jpg



实现访问解析sina博文

AndroidManifest.xml中添加一下权限  

1
2
< uses-permission  android:name = "android.permission.INTERNET" ></ uses-permission >
< uses-permission  android:name = "android.permission.ACCESS_NETWORK_STATE"  />

布局使用滚动条布局 ScrollView

1)在textview中设置超链接 android:autoLink="all"


2)android:fadingEdge="vertical" (可选项)

设置拉滚动条时 ,边框渐变的放向。none(边框颜色不变),horizontal(水平方向颜色变淡),vertical(垂直方向颜色变淡)。


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
<? xml  version = "1.0"  encoding = "utf-8" ?>
< ScrollView  xmlns:android = "http://schemas.android.com/apk/res/android"
     xmlns:tools = "http://schemas.android.com/tools"
     android:layout_width = "match_parent"
     android:layout_height = "match_parent"
     android:background = "@drawable/app_choose_btn_normalbg"
     android:fadingEdge = "vertical"
     android:scrollbars = "vertical"  >
  < LinearLayout
         android:layout_width = "match_parent"
         android:layout_height = "match_parent"
         android:orientation = "vertical"
         >
      < LinearLayout
     android:layout_width = "match_parent"
     android:layout_height = "wrap_content"
     android:padding = "5dp"
     android:orientation = "horizontal"
     android:background = "@drawable/grid_pictures_gdbg"
     >
     < ImageView
         android:id = "@+id/remote_searchhome"
         android:layout_width = "40dp"
         android:layout_height = "40dp"
         android:src = "@drawable/remote_search_home"
         />
     < EditText
         android:id = "@+id/remote_searedit"
         android:layout_width = "0dp"
         android:layout_height = "40dp"
         android:layout_weight = "1"
         android:singleLine = "true"
         />
     < ImageView
         android:id = "@+id/remote_searchbtn"
         android:layout_width = "40dp"
         android:layout_height = "40dp"
         android:src = "@drawable/search_btn_icon"
         />
</ LinearLayout >
     < TextView
         android:id = "@+id/remotetext"
         android:layout_height = "match_parent"
         android:layout_width = "match_parent"
         android:gravity = "top|left"
         android:background = "@drawable/backmain_bg"
         android:textColor = "@color/red"
         android:autoLink = "all"
         />
  </ LinearLayout >
</ ScrollView >


将httpclient和jsoup加载进libs(拖入libs即可)

111129211.jpg


编写java文件

涉及的sina博文内容以 http://blog.sina.com.cn/s/blog_89cc52f20101d1sh.html 为例

涉及的51cto博文内容以 《两年来的IT资源汇总 》


获取网页内容

截取博文内容关键:Jsoup中有个根据网页的class标签为记号提取内容的函数

1
2
Document myDocument = Jsoup.parse(str);
         Elements links = myDocument.getElementsByClass(divclass);


在一个sina博文网页中通过网页分析得知博文内容的class为articalContent;

112134974.jpg


网页内容获取与文章内容的提取MySelfHttpClient.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import  java.io.IOException;
import  org.apache.http.HttpResponse;
import  org.apache.http.HttpStatus;
import  org.apache.http.client.ClientProtocolException;
import  org.apache.http.client.HttpClient;
import  org.apache.http.client.methods.HttpGet;
import  org.apache.http.impl.client.DefaultHttpClient;
import  org.apache.http.util.EntityUtils;
import  org.jsoup.Jsoup;
import  org.jsoup.nodes.Document;
import  org.jsoup.nodes.Element;
import  org.jsoup.select.Elements;
public  class  MySelfHttpClient {
     //String divclass = "showContent";//51cto博客内容
     String divclass =  "articalContent" ; //sina博客内容
     public  MySelfHttpClient() {
         // TODO Auto-generated constructor stub
     }
     /**
      *
      *
      * @param link 链接地址
      * @param charSet 网页内容的编码类型
      * @return
      */
     public  String getStringFromLink(String link,String charSet){ //获取网页完整内容
         String str =  "" ;
         HttpGet request =  new  HttpGet(link);
         HttpClient httpClient =  new  DefaultHttpClient();
         try {
             HttpResponse response = httpClient.execute(request);
             if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK){
                 str = EntityUtils.toString(response.getEntity(), charSet);
             } else {
                 str =  "请求错误" ;
             }
         } catch (ClientProtocolException e){
             e.printStackTrace();
         } catch (IOException e){
             e.printStackTrace();
         }
         return  str;
     }
     /**
      *
      * @param str 截取divclass为标签的内容
      * @return 解析到的文章内容
      */
     public  String getContent(String str){ //截取divclass为标签的内容
         String content =  "" ;
         Document myDocument = Jsoup.parse(str);
         Elements links = myDocument.getElementsByClass(divclass);
         //Log.d("str", links.toString());
         for  (Element link : links) {
             content =content + link.text();
             }
         return  content;
     }
}


 判断系统是否联网

网络诊断ConnectionDetector.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import  android.content.Context;
import  android.net.ConnectivityManager;
import  android.net.NetworkInfo;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
public  class  ConnectionDetector {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
     private  Context _context;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
     public  ConnectionDetector(Context context){
         this ._context = context;
     }
  /**
   *
   *
   * @return true false  诊断是否联网
   */
     public  boolean  isConnectingToInternet(){
         ConnectivityManager connectivity = (ConnectivityManager) _context.getSystemService(Context.CONNECTIVITY_SERVICE);
           if  (connectivity !=  null )
           {
               NetworkInfo[] info = connectivity.getAllNetworkInfo();
               if  (info !=  null )
                   for  ( int  i =  0 ; i < info.length; i++)
                       if  (info[i].getState() == NetworkInfo.State.CONNECTED)
                       {
                           return  true ;
                       }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
           }
           return  false ;
     }
}


主要的.java实现

  关键1:判断你要解析的网页的编码 ,在sina跟51cto的网页中均没有看到关于页面编码的,不过大多网页都是utf-8或gbk

 关键2:设置textview类似editview一样能长安链接然后进行复制


1
2
3
4
5
6
7
8
/**************************/
//使textview能像edittext一样能复制文本的链接内容
remoteText.setFocusableInTouchMode( true );
remoteText.setFocusable( true );
remoteText.setClickable( true );
remoteText.setLongClickable( true );
remoteText.setMovementMethod(ArrowKeyMovementMethod.getInstance());
/**************************/


 主要实现RemoteText.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
package  com.remote;
import  com.remotepro.R;
import  android.app.Activity;
import  android.app.ProgressDialog;
import  android.os.Bundle;
import  android.os.Handler;
import  android.os.Message;
import  android.text.method.ArrowKeyMovementMethod;
import  android.view.View;
import  android.view.Window;
import  android.view.View.OnClickListener;
import  android.widget.EditText;
import  android.widget.ImageView;
import  android.widget.TextView;
import  android.widget.Toast;
public  class  RemoteText  extends  Activity{
     TextView remoteText;
     EditText myEditText;
     ImageView mySearchBtn;
     ImageView myHomeBtn;
     MySelfHttpClient mySelfHttpClient;
     String link =  "http://blog.sina.com.cn/s/blog_89cc52f20101d1sh.html" ; //sina博客
     String charSet =  "utf-8" ; //sina博客
     //String link = "http://7071976.blog.51cto.com/7061976/1289909";
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
     //String charSet = "gbk";
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
     String myText;
     //这句判断链接类型,在toast提示是否符合本次解析的网址类型
     String linktag =  "http://blog.sina.com.cn" ;//以sina为列子
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
     ConnectionDetector myConnectionDetector; //诊断时否联网
     ProgressDialog myProgressDialog =  null ; //加载进度条
     @Override
     protected  void  onCreate(Bundle savedInstanceState) {
         // TODO Auto-generated method stub
         super .onCreate(savedInstanceState);
         requestWindowFeature(Window.FEATURE_NO_TITLE);
         setContentView(R.layout.remotemain);
         init();
     }
     public  void  init(){
         remoteText = (TextView)findViewById(R.id.remotetext);
         myEditText = (EditText)findViewById(R.id.remote_searedit);
         mySearchBtn = (ImageView)findViewById(R.id.remote_searchbtn);
         myHomeBtn = (ImageView)findViewById(R.id.remote_searchhome);
         mySelfHttpClient =  new  MySelfHttpClient();
         myConnectionDetector =  new  ConnectionDetector( this );
         mySearchBtn.setOnClickListener(mySearcClick);
         myHomeBtn.setOnClickListener(myHomeClckListener);
         initText();
     }
     /***************************/
     public  void  initText(){
         if (myConnectionDetector.isConnectingToInternet()){
         myProgressDialog = ProgressDialog.show( this , getString(R.string.waiting), getResources().getString(R.string.loading));
         new  InitTextThead().start();
         }
     }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
     class  InitTextThead  extends  Thread{
         @Override
         public  void  run() {
             // TODO Auto-generated method stub
             super .run();
             //获取解析内容
             myText = mySelfHttpClient.getContent(mySelfHttpClient.getStringFromLink(link, charSet));
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
             myHandler.sendEmptyMessage( 1 );
         }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
     }
     Handler myHandler =  new  Handler(){
         @Override
         public  void  handleMessage(Message msg) {
             // TODO Auto-generated method stub
             super .handleMessage(msg);
             switch  (msg.what) {
             case  1 :
                 /**************************/
                 //使textview能像edittext一样能复制文本的链接内容
                 remoteText.setFocusableInTouchMode( true );
                 remoteText.setFocusable( true );
                 remoteText.setClickable( true );
                 remoteText.setLongClickable( true );
                 remoteText.setMovementMethod(ArrowKeyMovementMethod.getInstance());
                 /**************************/
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                 remoteText.setText(myText);
                 myProgressDialog.dismiss();
                 break ;
             case  2 :
                 remoteText.setFocusableInTouchMode( true );
                 remoteText.setFocusable( true );
                 remoteText.setClickable( true );
                 remoteText.setLongClickable( true );
                 remoteText.setMovementMethod(ArrowKeyMovementMethod.getInstance());
                 remoteText.setText(myText);
                 myProgressDialog.dismiss();
                 break ;
             case  3 :
                 myProgressDialog.dismiss();
                 Toast.makeText(RemoteText. this , R.string.errorlingaddr, Toast.LENGTH_LONG).show();
                 break ;
             default :
                 break ;
             }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
         }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
     };
     /********************************/
     OnClickListener mySearcClick =  new  OnClickListener() {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
         @Override
         public  void  onClick(View v) {
             // TODO Auto-generated method stub
             searchclick();
         }
     };
     public  void  searchclick(){
         if (myConnectionDetector.isConnectingToInternet()){
             myProgressDialog = ProgressDialog.show( this , getResources().getString(R.string.waiting), getResources().getString(R.string.loading));
         new  SearchThread().start();
         }
     }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
     class  SearchThread  extends  Thread{
         @Override
         public  void  run() {
             // TODO Auto-generated method stub
             super .run();
             String link = myEditText.getText().toString();
             if (link.startsWith(linktag)){
                 myText = mySelfHttpClient.getContent(mySelfHttpClient.getStringFromLink(link, charSet));
                 myHandler.sendEmptyMessage( 2 );
             } else {
                 myHandler.sendEmptyMessage( 3 );
             }
         }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
     }
     /********************************/
     OnClickListener myHomeClckListener =  new  OnClickListener() {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
         @Override
         public  void  onClick(View v) {
             // TODO Auto-generated method stub
             initText();
         }
     };
}


解析51cto博文,

 将MySelfHttpClient.java,RemoteText.java的注释修改


修改 MySelfHttpClient.java

1
2
//String divclass = "showContent";//51cto博客内容
     String divclass =  "articalContent" ; //sina博客内容



RemoteText.java的注释,51cto的网页编码为gbk

1
2
3
4
5
String link =  "http://blog.sina.com.cn/s/blog_89cc52f20101d1sh.html" ; //sina博客
     String charSet =  "utf-8" ; //sina博客
     //String link = "http://7071976.blog.51cto.com/7061976/1289909";
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
     //String charSet = "gbk";



 如果要使用本软件中的edit输入框使用链接,还需修改RemoteText.java中的linktag内容,

MySelfHttpClient.java

1
String linktag =  "http://blog.sina.com.cn" ;//判断editview中的链接是否合法,这里以sina为例


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class  SearchThread  extends  Thread{
     @Override
     public  void  run() {
         // TODO Auto-generated method stub
         super .run();
         String link = myEditText.getText().toString();
         if (link.startsWith(linktag)){
             myText = mySelfHttpClient.getContent(mySelfHttpClient.getStringFromLink(link, charSet));
             myHandler.sendEmptyMessage( 2 );
         } else {
             myHandler.sendEmptyMessage( 3 );
         }
     }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
}


解析http://7071976.blog.51cto.com/7061976/1289909 博文内容效果如下

120554730.jpg



总结:本文以获取博文内容为例,使用httpclient抓取网页内容,以jsoup为解析提取博文内容,看起来在text上显示的内容有点混乱,但这是可以改进的

技术推广:就以井冈山大学图书管理系统为例,这套图书系统是学校租用外面公司的,安一般思路要开发图书馆里系统客户端需要后台数据库接出个站点提供数据检索,但那个公司不提供这方面的服务,那么可以通过httpclient解析网页实现登录,查询,续借等功能,这样一个android客户端的就能实现了。



本文转自lilin9105 51CTO博客,原文链接:http://blog.51cto.com/7071976/1297327,如需转载请自行联系原作者
相关文章
|
6月前
|
数据采集
JSoup 爬虫遇到的 404 错误解决方案
JSoup 爬虫遇到的 404 错误解决方案
|
6月前
httpclient 模拟登陆网站 获取网站内容程序
httpclient 模拟登陆网站 获取网站内容程序
35 3
|
5月前
Jsoup获取url所有链接
Jsoup获取url所有链接
33 1
|
6月前
|
域名解析 缓存 网络协议
JavaEE精选-HTTP
JavaEE精选-HTTP
46 1
|
6月前
|
缓存 Java API
HttpClient使用笔记干货满满
HttpClient使用笔记干货满满
105 0
|
数据采集 Ubuntu 数据安全/隐私保护
Restclient-cpp库介绍和实际应用:爬取www.sohu.com
Restclient-cpp是一个用C++编写的简单而优雅的RESTful客户端库,它可以方便地发送HTTP请求和处理响应。它基于libcurl和jsoncpp,支持GET, POST, PUT, PATCH, DELETE, HEAD等方法,以及自定义HTTP头部,超时设置,代理服务器等功能。 本文将介绍如何使用Restclient-cpp库来实现一个简单的爬虫程序,爬取www.sohu.com网站的内容,并将其保存为本地文件。为了避免被目标网站屏蔽或限制访问,我们还将使用亿牛云爬虫代理来提供高效稳定的代理IP服务。
155 0
|
测试技术
JavaWeb - Hutool Bug HttpResponse body 方法中文乱码
JavaWeb - Hutool Bug HttpResponse body 方法中文乱码
759 0
|
Web App开发 JavaScript 前端开发
|
Web App开发 前端开发 JavaScript
Jsoup教程,jsoup开发指南,jsoup中文使用手册,jsoup中文文档
jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。 jsoup的主要功能如下: 从一个URL,文件或字符串中解析HTML; 使用DOM或CSS选择器来查找、取出数据; 可操作HTML元素、属性、文本; jsoup是基于MIT协议发布的,可放心使用于商业
2665 0