书接上一回:
实例三:
数据提取
要求:从一段HTML代码中提取出所有的email地址和< a href...>tag中的链接地址
public
class
HtmlTest {
public
static
void
main(String[] args) {
String htmlText =
"<html>"
+
"<a href=\"testone@163.com\">163test</a>\n"
+
"<a href='www.163.com@163-com.com'>163news</a>\n"
+
"<a href=http://www.163.com>163lady</a>\n"
+
"<a href = http://sports.163.com>网易体育</a>\n"
+
"<a href = \"http://gz.house.163.com\">网易房产</a>\n"
+
".leemaster@163"
+
"luckdog.com"
+
"</html>"
;
System.
out
.println(
"开始检查email"
);
for
(String email :
extractEmail
(htmlText)) {
System.
out
.println(
"邮箱是:"
+ email);
}
System.
out
.println(
"开始检查超链接"
);
for
(String link :
extractLink
(htmlText)) {
System.
out
.println(
"超链接是:"
+ link);
}
}
private
static
List<String> extractLink(String htmlText) {
List<String> result =
new
ArrayList<String>();
Pattern p = Pattern.
compile
(Regexes.
HREF_LINK_REGEX
);
Matcher m = p.matcher(htmlText);
while
(m.find()) {
result.add(m.group());
}
return
result;
}
private
static
List<String> extractEmail(String htmlText) {
List<String> result =
new
ArrayList<String>();
Pattern p = Pattern.
compile
(Regexes.
EMAIL_REGEX
);
Matcher m = p.matcher(htmlText);
while
(m.find()) {
result.add(m.group());
}
return
result;
}
}
public
class
Regexes {
public
static
final
String
EMAIL_REGEX
=
"(?i)(?<=\\b)[a-z0-9][-a-z0-9_.]+[a-z0-9]@([a-z0-9][-a-z0-9]+\\.)+[a-z]{2,4}(?=\\b)"
;
public
static
final
String
HREF_LINK_REGEX
=
"(?i)<a\\s+href\\s*=\\s*['\"]?([^'\"\\s>]+)['\"\\s>]"
;
}
运行结果:
开始检查email
邮箱是:testone@163.com
邮箱是:www.163.com@163-com.com
邮箱是:leemaster@163luckdog.com
开始检查超链接
超链接是:<a href="testone@163.com"
超链接是:<a href='www.163.com@163-com.com'
超链接是:<a href=http://www.163.com>
超链接是:<a href = http://sports.163.com>
超链接是:<a href = "http://gz.house.163.com"
实例四:
查找重复单词
要求:查找一段文本中是否存在重复单词,如果存在,去掉重复单词。
public
class
FindWord {
public
static
void
main(String[] args) {
String[] sentences =
new
String[] {
"this is a normal sentence"
,
"Oh,my god!Duplicate word word"
,
"This sentence contain no duplicate word words"
};
for
(String sentence:sentences){
System.
out
.println(
"校验句子:"
+sentence);
if
(
containDupWord
(sentence)){
System.
out
.println(
"Duplicate word found!!"
);
System.
out
.println(
"正在去除重复单词"
+
removeDupWords
(sentence));
}
System.
out
.println(
""
);
}
}
private
static
String removeDupWords(String
sentence
) {
String regex = Regexes.
DUP_WORD_REGEX
;
return
sentence
.replaceAll(regex,
"$1"
);
}
private
static
boolean
containDupWord(String sentence) {
String regex = Regexes.
DUP_WORD_REGEX
;
Pattern p = Pattern.
compile
(regex);
Matcher m = p.matcher(sentence);
if
(m.find()){
return
true
;
}
else
{
return
false
;
}
}
}
public
class
Regexes
{
public
static
final
String
DUP_WORD_REGEX
=
"(?<=\\b)(\\w+)\\s+\\1(?=\\b)"
;
}
运行结果:
校验句子:this is a normal sentence
校验句子:Oh,my god!Duplicate word word
Duplicate word found!!
正在去除重复单词Oh,my god!Duplicate word
校验句子:This sentence contain no duplicate word words
未完待续。。。
本文转自jooben 51CTO博客,原文链接:http://blog.51cto.com/jooben/316592