[爬虫]新型肺炎疫情数据Web爬虫

唐宋丶

February 5, 2020

2677 views

One comment

3856 words

开发项目

# 新型肺炎疫情数据Web爬虫+数据持久化+邮件通知

数据源 ：[丁香园][1]
项目参考：[https://gitee.com/TicsmycL/nCoV_Crawler2019][2]

* 共四部分数据： 实时新闻 + 全局信息（全国 确诊x例 疑似x例 死亡x例 治愈x例）+ 各省份疫情 + 辟谣
* 数据库用MYSQL，建表语句在SQL目录下
* 功能：
  * jsoup获取数据，正则匹配筛选数据
  * 用mybatis+mysql做数据持久化
  * 数据发生变化时，发邮件通知

## 本项目涉及到的技术：

1. Logger日志追踪
2. MVC设计模式
3. Mybatis3.0原生态SqlSession，工厂设计模式、事务级提交
4. 基于Jsoup的Web爬虫数据抓取以及规范
5. 基于 `java.util.regex.Matcher`、`java.util.regex.Pattern`的正则匹配目的数据
6. JSON解析器的构造

---

## 主要内容

```Java
public class InformationService {
    private final Logger logger = Logger.getLogger(InformationService.class);

private InformationDao informationDao;

public InformationService() {
        this.informationDao = new InformationDao();
    }

public void getNews(){
        //获取HTML数据
        Tools.getPageByJSoup(Crawler.URL);

//提取static信息的json数据
        String staticInformation = null;
        //解析static信息的json数据
        Statistics statisticsInformation = null;
        //解析rymorList信息的json数据
        RymorList rymorList = null;

try{
            staticInformation=Tools.getInformation(Crawler.STATIC_INFORMATION_REGEX_TEMPLATE_1,"id",Crawler.STATIC_INFORMATION_ATTRIBUTE);
            statisticsInformation= Parse.parseStatisticsInformation(staticInformation);
        }catch(NullPointerException e ){
            logger.error("static信息正则1匹配失败，切换正则2");
            staticInformation=Tools.getInformation(Crawler.STATIC_INFORMATION_REGEX_TEMPLATE_2,"id",Crawler.STATIC_INFORMATION_ATTRIBUTE);
            statisticsInformation= Parse.parseStatisticsInformation(staticInformation);
        }

//数据持久化
        String timeLineNews =null;
        String provinceNews=null;
        String rymorNews=null;

String statisticsNews = informationDao.insertStatistics(statisticsInformation);
        if (statisticsNews != null){
            //总数据发生变化，各省数据更新
            //提取其他信息的json数据

String timelineServiceInformation= Tools.getInformation(Crawler.TIME_LINE_REGEX_TEMPLATE,"id",Crawler.TIME_LINE_ATTRIBUTE);
            String areaInformation=Tools.getInformation(Crawler.AREA_INFORMATION_REGEX_TEMPLATE,"id",Crawler.AREA_INFORMATION_ATTRIBUTE);
            String rymorListInformation = Tools.getInformation(Crawler.RYMOR_INFORMATION_REGEX_TEMPLATE,"id",Crawler.RYMOR_INFORMATION_ATTRIBUTE);

//解析
            List<TimeLine> timeLineList = Parse.parseTimeLineInformation(timelineServiceInformation);
            List<AreaStat> areaStatList = Parse.parseAreaInformation(areaInformation);
            System.out.println("---"+rymorListInformation);
            List<RymorList> rymorLists = Parse.parseRymorInformation(rymorListInformation);

timeLineNews = informationDao.insertTimeLine(timeLineList);
            provinceNews = informationDao.insertProvince(areaStatList);
            rymorNews = informationDao.inserRymorList(rymorLists);

//            sendEmail(timeLineNews,provinceNews,statisticsNews);
        }

informationDao.destory();
    }
}
```

数据转换流程解析：

```Java
[Jsoup.connect(url).get()]Dpcument(page)-->
[m.group()]String(result)-->
[JSON.parseArray(string)]JSONArray(jsonArray)-->
[JSON.toJavaObject((JSON) jsonObj,Pojo.class)]JavaBean(pojo)
```

**jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据：**[Jsoup详解参考地址][3]

---

## 项目下载

项目地址：[nCoV_Crawler2019][4]
附：本项目仅供学习，持续更新请关注参考项目
![](https://s2.ax1x.com/2020/02/05/1rB5Mq.png)
![](https://s2.ax1x.com/2020/02/05/1rBfRs.png)
![](https://s2.ax1x.com/2020/02/05/1rBhzn.png)
**注:**`Navicat可将表导出成exel`

[1]: https://ncov.dxy.cn/ncovh5/view/pneumonia_peopleapp
[2]: https://gitee.com/TicsmycL/nCoV_Crawler2019
[3]: https://www.cnblogs.com/zhangyinhua/p/8037599.html
[4]: https://github.com/TangSong99/nCoV_Crawler2019

Last modification：August 10, 2022

喵ฅฅ

[爬虫]新型肺炎疫情数据Web爬虫

唐宋丶 • 2020 年 02 月 05 日

One comment

Leave a Comment Cancel reply

[爬虫]新型肺炎疫情数据Web爬虫