数据结构与保存

数据结构与保存
1. 将新闻的正文内容保存到文本文件。

1

2

3

4

def writeNewsDetail(content):

    f = open('gzccNews.txt', 'a',encoding='utf-8')

    f.write(content)

    f.close()

　　

2. 将新闻数据结构化为字典的列表:
- 单条新闻的详情-->字典news
- 一个列表页所有单条新闻汇总-->列表newsls.append(news)
- 所有列表页的所有新闻汇总列表newstotal.extend(newsls)
- import requests
  
  from bs4 import BeautifulSoup
  
  from datetime import datetime
  
  import re
  
  import pandas
  
  
  
  
  
  def writeNewsDetail(content):
  
      f = open('gzccNews.txt', 'a',encoding='utf-8')
  
      f.write(content)
  
      f.close()
  
  
  
  
  
  def getClickCount(newsUrl): #一篇新闻的点击次数
  
      newId = re.search('\_(.*).html', newsUrl).group(1).split('/')[1]
  
      clickUrl = "http://oa.gzcc.cn/api.php?op=count&id={}&modelid=80".format(newId)
  
      return (int(requests.get(clickUrl).text.split('.html')[-1].lstrip("('").rstrip("');")))
  
  
  
  
  
  def getNewsDetail(newsUrl): #一篇新闻的全部信息
  
      resd = requests.get(newsUrl)
  
      resd.encoding = 'utf-8'
  
      soupd = BeautifulSoup(resd.text, 'html.parser') # 打开新闻详情页并解析
  
  
  
      news ={}
  
      news['title'] = soupd.select('.show-title')[0].text
  
      info = soupd.select('.show-info')[0].text
  
      news['dt'] = datetime.strptime(info.lstrip('发布时间:')[0:19], '%Y-%m-%d %H:%M:%S')
  
      if info.find('来源：') > 0: # 作者：审核：来源：摄影：一样处理
  
          news['source'] = info[info.find('来源：'):].split()[0].lstrip('来源：')
  
      else:
  
          news['source'] = 'none'
  
      news['content'] = soupd.select('.show-content')[0].text.strip()
  
      writeNewsDetail(news['content'])
  
      news['click'] = getClickCount(newsUrl)
  
      news['newsUrl'] = newsUrl
  
      return(news)
  
  
  
  
  
  def getListPage(pageUrl): #一个列表页的全部新闻
  
      res = requests.get(pageUrl)
  
      res.encoding = 'utf-8'
  
      soup = BeautifulSoup(res.text, 'html.parser')
  
  
  
      newsList = []
  
      for news in soup.select('li'):
  
          if len(news.select('.news-list-title')) > 0:
  
              newsUrl = news.select('a')[0].attrs['href'] # 链接
  
              newsList.append(getNewsDetail(newsUrl))
  
      return(newsList)
  
  
  
  
  
  def getPageN(): # 新闻列表页的总页数
  
      res = requests.get('http://news.gzcc.cn/html/xiaoyuanxinwen/')
  
      res.encoding = 'utf-8'
  
      soup = BeautifulSoup(res.text, 'html.parser')
  
      n = int(soup.select('.a1')[0].text.rstrip('条'))
  
      return (n // 10 + 1)
  
  
  
  newsTotal = []
  
  firstPageUrl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/'
  
  newsTotal.extend(getListPage(firstPageUrl))
  
  
  
  n = getPageN()
  
  for i in range(n, n+1):
  
      listPageUrl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html'.format(i)
  
      newsTotal.extend(getListPage(listPageUrl))
  
  
  
  print(newsTotal)
  
  　　3. 安装pandas，用pandas.DataFrame(newstotal)，创建一个DataFrame对象df.
  
  1
  
  2
  
  df = pandas.DataFrame(newsTotal)
  
  print(df)
  
  　　4. 通过df将提取的数据保存到csv或excel 文件。
  
  1
  
  df.to_excel('gzccnews0416.xlsx')
  
  　　
  
  5. 用pandas提供的函数和方法进行数据分析：
  - 提取包含点击次数、标题、来源的前6行数据
  - 提取‘学校综合办’发布的，‘点击次数’超过3000的新闻。
  - 提取'国际学院'和'学生工作处'发布的新闻。
    
    1
    
    2
    
    3
    
    print(df[(df['click'] > 3000)&(df['source']=='学校综合办')])
    
    print(sourcelist = ['学生工作处','国际学院'])
    
    print(df[df['source'].isin(sourcelist)])
相关阅读:
Web Service接口返回泛型的问题（System.InvalidCastException: 无法将类型为“System.Collections.Generic.List`1[System.String]”的对象强制转换为类型“System.String[]”）
Asp.net简单代码设置GridView自适应列宽不变形
 iframe自适应高度
 ASP.NET用户控件操作ASPX页面
 C#里面Auotpostback回刷时候，textbox里面的password怎么保存
 CentOS 7在执行yum操作时报错
 git 查看/修改用户名、密码
 树、二叉树、满二叉树、完全二叉树
 C#中常见的winform控件命名规范转
 C#控件命名规范
原文地址：https://www.cnblogs.com/dean666/p/8870232.html