Django+Python抓取网易新闻

安装beautifulsoup4
[shell]pip install beautifulsoup4[/shell]
新建一个项目lzyone
[shell]django-admin startproject lzyone[/shell]
进入lzyone目录,新建一个应用news
[shell]python manage.py startapp news[/shell]
lzyone/lzyone/urls.py
[python]
from django.conf.urls import url
from django.contrib import admin
from news import views

urlpatterns = [
url(r’^admin/’, admin.site.urls),
url(r’^$’, views.news),
]
[/python]
进入lzyone目录,新建templates目录,进入lzyone/templates目录,新建模板文件news.htm
[html]
<html xmlns="http://www.w3.org/1999/xhtml">
<head></head>
<body>
{% load static %}
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>新闻/文章列表</title>
<link rel="stylesheet" href="{% static ‘css/style.css’ %}" type="text/css" />
<dl class="list_dl">
<dt>
<b>推荐</b>
<a href="#" class="more">&gt;&gt;更多</a>
</dt>
<dd>
<ul>
{% for athlete in athlete_list %}
<li><span>{{athlete.ptime}}</span> <a href="{{athlete.url}}">{{athlete.title}}</a></li> {% endfor %}
</ul>
</dd>
</dl>
<footer>
&copy; 2018 lzy.one
<script type="text/javascript">var cnzz_protocol = (("https:" == document.location.protocol) ? " https://" : " http://");document.write(unescape("%3Cspan id=’cnzz_stat_icon_1272888821’%3E%3C/span%3E%3Cscript src=’" + cnzz_protocol + "s22.cnzz.com/stat.php%3Fid%3D1272888821%26show%3Dpic’ type=’text/javascript’%3E%3C/script%3E"));</script>
</footer>
</body>
</html>
[/html]
style.css
[css]
ul,li,ol,dl,dt,dd {
padding:0px;
margin:0px;
list-style-type: none;
}
a {
text-decoration: none;
}
.list_dl {
width:100%;
height:auto;
display:block;
overflow:hidden;
margin-bottom:8px;
font-size:10pt;
}
.list_dl dt {
width:100%;
height:24px;
margin-bottom:1px;
background-color:#003366;
border-bottom-width: 2px;
border-bottom-style: solid;
border-bottom-color: #FF9933;
background-image: url(images/right_tit.gif);
background-repeat:repeat-x;
background-position: right top;
}
.list_dl dt b {
float:left;
width:240px;
height:24px;
line-height:24px;
display:block;
color:#FFFFFF;
margin-left:12px;
}
.list_dl dt a {
width:5em;
height:23px;
display:block;
line-height:23px;
margin-top:1px;
color:#FFFFFF;
float:right;
text-align:right;
padding-right:10px;
}
.list_dl dt a.more {
color:#C1CEDB;
}
.list_dl dt a.more:hover {
color:#fff;
}
.list_dl dd {
display:block;
margin-top:4px;
clear:both;
}
.list_dl ul li {
text-align:left;
text-indent: 1.3em;
line-height:220%;
border-bottom-width: 1px;
border-bottom-style: dashed;
border-bottom-color: #CCCCCC;
background-image: url(images/list_ico.gif);
background-repeat:no-repeat;
background-position: 4px center;
}
a.link1 {
color:#797979;
}
.list_dl ul li span {
float:right;
color:#9B9B9B;
margin-right:7px;
}
[/css]
模板中使用static静态文件配置,参考:
[shell]http://blog.csdn.net/sinat_21302587/article/details/74059078[/shell]
网易新闻URL

推荐:http://3g.163.com/touch/article/list/BA8J7DG9wangning/20-20.html 主要修改20-20
新闻:http://3g.163.com/touch/article/list/BBM54PGAwangning/0-10.html
娱乐:http://3g.163.com/touch/article/list/BA10TA81wangning/0-10.html
体育:http://3g.163.com/touch/article/list/BA8E6OEOwangning/0-10.html
财经:http://3g.163.com/touch/article/list/BA8EE5GMwangning/0-10.html
时尚:http://3g.163.com/touch/article/list/BA8F6ICNwangning/0-10.html
军事:http://3g.163.com/touch/article/list/BAI67OGGwangning/0-10.html
手机:http://3g.163.com/touch/article/list/BAI6I0O5wangning/0-10.html
科技:http://3g.163.com/touch/article/list/BA8D4A3Rwangning/0-10.html
游戏:http://3g.163.com/touch/article/list/BAI6RHDKwangning/0-10.html
数码:http://3g.163.com/touch/article/list/BAI6JOD9wangning/0-10.html
教育:http://3g.163.com/touch/article/list/BA8FF5PRwangning/0-10.html
健康:http://3g.163.com/touch/article/list/BDC4QSV3wangning/0-10.html
汽车:http://3g.163.com/touch/article/list/BA8DOPCSwangning/0-10.html
家居:http://3g.163.com/touch/article/list/BAI6P3NDwangning/0-10.html
房产:http://3g.163.com/touch/article/list/BAI6MTODwangning/0-10.html
旅游:http://3g.163.com/touch/article/list/BEO4GINLwangning/0-10.html
亲子:http://3g.163.com/touch/article/list/BEO4PONRwangning/0-10.html

进入news目录,编辑视图,views.py
[python]# -*- coding: utf-8 -*-
from __future__ import unicode_literals

import sys
import urllib2
from bs4 import BeautifulSoup
import socket
import httplib
import json
import requests
from django.shortcuts import render

reload(sys)
sys.setdefaultencoding(‘utf-8’)

def news(request):
url = "http://3g.163.com/touch/article/list/BA8J7DG9wangning/20-20.html"
print url
context = {}
context[‘athlete_list’] = getNews(url)
return render(request, ‘news.htm’, context)

def getNews(url):
print url
request = urllib2.Request(url)
request.add_header(‘User-Agent’,’Mozilla/5.0 (Windows NT 6.1; \
WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36′)
try:
html = urllib2.urlopen(request)
except socket.timeout, e:
pass
except urllib2.URLError,ee:
pass
except httplib.BadStatusLine:
pass
hjson = json.loads(html.read().lstrip(‘artiList(‘).rstrip(‘)’))
return hjson[‘BA8J7DG9wangning’]
[/python]
测试:
http://lzy.one/

Djando过滤器格式化
[html]
<ul>
{% for athlete in wangyi_news.list %}
<li><span>{%if athlete.ptime|length > 10 %}
{{athlete.ptime|slice:"5:10"}}
{%else%} {{athlete.pthime}}
{%endif%}</span> <a href="{{athlete.url}}">{{athlete.title}}</a></li> {% endfor %}
</ul>
[/html]