帶你五分鐘了解jsoup教程

soup是一款Java的HTML解析器，主要用來對HTML解析。

在爬蟲的時候，當我們用HttpClient之類的框架，獲取到網頁源碼之后，需要從網頁源碼中取出我們想要的內容，

就可以使用jsoup這類HTML解析器了。可以非常輕松的實現。

雖然jsoup也支持從某個地址直接去爬取網頁源碼，但是只支持HTTP，HTTPS協議，支持不夠豐富。

所以，主要還是用來對HTML進行解析。

其中，要被解析的HTML可以是一個HTML的字符串，可以是一個URL，可以是一個文件。

org.jsoup.Jsoup把輸入的HTML轉換成一個org.jsoup.nodes.Document對象，然后從Document對象中取出想要的元素。

org.jsoup.nodes.Document繼承了org.jsoup.nodes.Element，Element又繼承了org.jsoup.nodes.Node類。里面提供了豐富的方法來獲取HTML的元素。

從URL獲取HTML來解析

Document doc = Jsoup.connect("http://www.baidu.com/").get();
String title = doc.title();

其中Jsoup.connect("xxx")方法返回一個org.jsoup.Connection對象。
在Connection對象中，我們可以執行get或者post來執行請求。但是在執行請求之前，
我們可以使用Connection對象來設置一些請求信息。比如：頭信息，cookie，請求等待時間，代理等等來模擬瀏覽器的行為。

Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

獲得Document對象后，接下來就是解析Document對象，并從中獲取我們想要的元素了。

Document中提供了豐富的方法來獲取指定元素。

使用DOM的方式來取得

getElementById(String id)：通過id來獲取
　　getElementsByTag(String tagName)：通過標簽名字來獲取
　　getElementsByClass(String className)：通過類名來獲取
　　getElementsByAttribute(String key)：通過屬性名字來獲取
　　getElementsByAttributeValue(String key, String value)：通過指定的屬性名字，屬性值來獲取
　　getAllElements()：獲取所有元素

通過類似于css或jQuery的選擇器來查找元素

使用的是Element類的下記方法：

public Elements select(String cssQuery)

通過傳入一個類似于CSS或jQuery的選擇器字符串，來查找指定元素。

例子：

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); //帶有href屬性的a元素
Elements pngs = doc.select("img[src$=.png]");
  //擴展名為.png的圖片

Element masthead = doc.select("div.masthead").first();
  //class等于masthead的div標簽

Elements resultLinks = doc.select("h3.r > a"); //在h3元素之后的a元素

選擇器的更多語法(可以在org.jsoup.select.Selector中查看到更多關于選擇器的語法)：

tagname: 通過標簽查找元素，比如：a
　　ns|tag: 通過標簽在命名空間查找元素，比如：可以用 fb|name 語法來查找 <fb:name> 元素
　　#id: 通過ID查找元素，比如：#logo
　　.class: 通過class名稱查找元素，比如：.masthead
　　[attribute]: 利用屬性查找元素，比如：[href]
　　[^attr]: 利用屬性名前綴來查找元素，比如：可以用[^data-] 來查找帶有HTML5 Dataset屬性的元素
　　[attr=value]: 利用屬性值來查找元素，比如：[width=500]
　　[attr^=value], [attr$=value], [attr=value]: 利用匹配屬性值開頭、結尾或包含屬性值來查找元素，比如：[href=/path/]
　　[attr~=regex]: 利用屬性值匹配正則表達式來查找元素，比如： img[src~=(?i).(png|jpe?g)]
　　*: 這個符號將匹配所有元素

Selector選擇器組合使用
　　el#id: 元素+ID，比如： div#logo
　　el.class: 元素+class，比如： div.masthead
　　el[attr]: 元素+class，比如： a[href]
　　任意組合，比如：a[href].highlight
　　ancestor child: 查找某個元素下子元素，比如：可以用.body p 查找在"body"元素下的所有 p元素
　　parent > child: 查找某個父元素下的直接子元素，比如：可以用div.content > p 查找 p 元素，也可以用body > * 查找body標簽下所有直接子元素
　　siblingA + siblingB: 查找在A元素之前第一個同級元素B，比如：div.head + div
　　siblingA ~ siblingX: 查找A元素之前的同級X元素，比如：h1 ~ p
　　el, el, el:多個選擇器組合，查找匹配任一選擇器的唯一元素，例如：div.masthead, div.logo

偽選擇器selectors
　　:lt(n): 查找哪些元素的同級索引值（它的位置在DOM樹中是相對于它的父節點）小于n，比如：td:lt(3) 表示小于三列的元素
　　:gt(n):查找哪些元素的同級索引值大于n，比如： div p:gt(2)表示哪些div中有包含2個以上的p元素
　　:eq(n): 查找哪些元素的同級索引值與n相等，比如：form input:eq(1)表示包含一個input標簽的Form元素
　　:has(seletor): 查找匹配選擇器包含元素的元素，比如：div:has(p)表示哪些div包含了p元素
　　:not(selector): 查找與選擇器不匹配的元素，比如： div:not(.logo) 表示不包含 class="logo" 元素的所有 div 列表
　　:contains(text): 查找包含給定文本的元素，搜索不區分大不寫，比如： p:contains(jsoup)
　　:containsOwn(text): 查找直接包含給定文本的元素
　　:matches(regex): 查找哪些元素的文本匹配指定的正則表達式，比如：div:matches((?i)login)
　　:matchesOwn(regex): 查找自身包含文本匹配指定正則表達式的元素
注意　　：上述偽選擇器索引是從0開始的，也就是說第一個元素索引值為0，第二個元素index為1等

通過上面的選擇器，我們可以取得一個Elements對象，它繼承了ArrayList對象，里面放的全是Element對象。

接下來我們要做的就是從Element對象中，取出我們真正需要的內容。

通常有下面幾種方法：

Element.text()

這個方法用來取得一個元素中的文本。

Element.html()或Node.outerHtml()

這個方法用來取得一個元素中的html內容

Node.attr(String key)

獲得一個屬性的值，例如取得超鏈接<a href="">中href的值

文本福利：

為了讓大家更快速高效的學習,我整理了一份 Java 全能資料包含(高可用、高并發、高性能及分布式、 Jvm 性能調優、 Spring 源碼, MyBatis , Netty , Redis , Kafka , Mysql , Zookeeper , Tomcat , Docker , Dubbo , Nginx ,架構,面試等等…)

大家可自行領取！

領取方式-關注私信回復我 6

文適合有 Java 基礎知識的人群

本文作者：HelloGitHub-秦人

HelloGitHub 推出的《講解開源項目》系列，今天給大家帶來一款開源 Java 版一款網頁元素解析框架——jsoup，通過程序自動獲取網頁數據。

項目源碼地址：https://github.com/jhy/jsoup

一、項目介紹

jsoup 是一款 Java 的 HTML 解析器。可直接解析某個 URL 地址的 HTML 文本內容。它提供了一套很省力的 API，可通過 DOM、CSS 以及類似于 jQuery 選擇器的操作方法來取出和操作數據。

jsoup 主要功能：

從一個 URL、文件或字符串中解析 HTML。
使用 DOM 或 CSS 選擇器來查找、取出數據。
可操作 HTML 元素、屬性、文本。

二、使用框架

2.1 準備工作

掌握 HTML 語法
Chrome 瀏覽器調試技巧
掌握開發工具 idea 的基本操作

2.2 學習源碼

將項目導入 idea 開發工具，會自動下載 maven 項目需要的依賴。源碼的項目結構如下：

快速學習源碼是每個程序員必備的技能，我總結了以下幾點：

閱讀項目 ReadMe 文件，可以快速知道項目是做什么的。
概覽項目 pom.xml 文件，了解項目引用了哪些依賴。
查看項目結構、源碼目錄、測試用例目錄，好的項目結構清晰，層次明確。
運行測試用例，快速體驗項目。

2.3 下載項目

git clone https://github.com/jhy/jsoup

2.4 運行項目測試代碼

通過上面的方法，我們很快可知 example 目錄是測試代碼，那我們直接來運行。注：有些測試代碼需要稍微改造一下才可以運行。

例如，jsoup 的 Wikipedia 測試代碼：

public class Wikipedia {
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
        log(doc.title());

        Elements newsHeadlines = doc.select("#mp-itn b a");
        for (Element headline : newsHeadlines) {
            log("%s\n\t%s", headline.attr("title"), headline.absUrl("href"));
        }
    }

    private static void log(String msg, String... vals) {
        System.out.println(String.format(msg, vals));
    }
}

說明：上面代碼是獲取頁面（http://en.wikipedia.org/）包含（#mp-itn b a）選擇器的所有元素，并打印這些元素的 title , herf 屬性。維基百科國內無法訪問，所以上面這段代碼運行會報錯。

改造后可運行的代碼如下：

public static void main(String[] args) throws IOException {
    Document doc = Jsoup.connect("https://www.baidu.com/").get();
    Elements newsHeadlines = doc.select("a[href]");
    for (Element headline : newsHeadlines) {
        System.out.println("href: " +headline.absUrl("href") );
    }
}

三、工作原理

Jsoup 的工作原理，首先需要指定一個 URL，框架發送 HTTP 請求，然后獲取響應頁面內容，然后通過各種選擇器獲取頁面數據。整個工作流程如下圖：

以上面為例：

3.1 發請求

Document doc = Jsoup.connect("https://www.baidu.com/").get();

這行代碼就是發送 HTTP 請求，并獲取頁面響應數據。

3.2 數據篩選

Elements newsHeadlines = doc.select("a[href]");

定義選擇器，獲取匹配選擇器的數據。

3.3 數據處理

for (Element headline : newsHeadlines) {
        System.out.println("href: " +headline.absUrl("href") );
    }

這里對數據只做了一個簡單的數據打印，當然這些數據可寫入文件或數據的。

四、實戰

獲取豆瓣讀書 -> 新書速遞中每本新書的基本信息。包括：書名、書圖片鏈接、作者、內容簡介（詳情頁面）、作者簡介（詳情頁面）、當當網書的價格（詳情頁面），最后將獲取的數據保存到 Excel 文件。

目標鏈接：https://book.douban.com/latest?icn=index-latestbook-all

4.1 項目 pom.xml 文件

項目引入 jsoup、lombok、easyexcel 三個庫。

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>JsoupTest</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <maven.compiler.target>1.8</maven.compiler.target>
        <maven.compiler.source>1.8</maven.compiler.source>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.13.1</version>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.18.12</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>easyexcel</artifactId>
            <version>2.2.6</version>
        </dependency>
    </dependencies>
</project>

4.2 解析頁面數據

public class BookInfoUtils {

    public static List<BookEntity> getBookInfoList(String url) throws IOException {
        List<BookEntity>  bookEntities=new ArrayList<>();
        Document doc = Jsoup.connect(url).get();
        Elements liDiv = doc.select("#content > div > div.article > ul > li");
        for (Element li : liDiv) {
            Elements urls = li.select("a[href]");
            Elements imgUrl = li.select("a > img");
            Elements bookName = li.select(" div > h2 > a");
            Elements starsCount = li.select(" div > p.rating > span.font-small.color-lightgray");
            Elements author = li.select("div > p.color-gray");
            Elements description = li.select(" div > p.detail");

            String bookDetailUrl = urls.get(0).attr("href");
            BookDetailInfo detailInfo = getDetailInfo(bookDetailUrl);
            BookEntity bookEntity = BookEntity.builder()
                    .detailPageUrl(bookDetailUrl)
                    .bookImgUrl(imgUrl.attr("src"))
                    .bookName(bookName.html())
                    .starsCount(starsCount.html())
                    .author(author.text())
                    .bookDetailInfo(detailInfo)
                    .description(description.html())
                    .build();
//            System.out.println(bookEntity);
            bookEntities.add(bookEntity);
        }
        return bookEntities;
    }
    /**
     *
     * @param detailUrl
     * @return
     * @throws IOException
     */
    public static BookDetailInfo getDetailInfo(String detailUrl)throws IOException{

        Document doc = Jsoup.connect(detailUrl).get();
        Elements content = doc.select("body");

        Elements price = content.select("#buyinfo-printed > ul.bs.current-version-list > li:nth-child(2) > div.cell.price-btn-wrapper > div.cell.impression_track_mod_buyinfo > div.cell.price-wrapper > a > span");
        Elements author = content.select("#info > span:nth-child(1) > a");
        BookDetailInfo bookDetailInfo = BookDetailInfo.builder()
                .author(author.html())
                .authorUrl(author.attr("href"))
                .price(price.html())
                .build();
        return bookDetailInfo;
    }
}

這里的重點是要獲取網頁對應元素的選擇器。

例如：獲取 li.select("div > p.color-gray") 中 div > p.color-gray 是怎么知道的。

使用 chrome 的小伙伴應該都猜到了。打開 chrome 瀏覽器 Debug 模式，Ctrl + Shift +C 選擇一個元素,然后在 html 右鍵選擇 Copy ->Copy selector,這樣就可以獲取當前元素的選擇器。如下圖：

4.3 存儲數據到 Excel

為了數據更好查看，我將通過 jsoup 抓取的數據存儲的 Excel 文件，這里我使用的 easyexcel 快速生成 Excel 文件。

Excel 表頭信息

@Data
@Builder
public class ColumnData {

    @ExcelProperty("書名稱")
    private String bookName;

    @ExcelProperty("評分")
    private String starsCount;

    @ExcelProperty("作者")
    private String author;

    @ExcelProperty("封面圖片")
    private String bookImgUrl;

    @ExcelProperty("簡介")
    private String description;

    @ExcelProperty("單價")
    private String price;
}

生成 Excel 文件

public class EasyExcelUtils {

    public static void simpleWrite(List<BookEntity> bookEntityList) {
        String fileName = "D:\\devEnv\\JsoupTest\\bookList" + System.currentTimeMillis() + ".xlsx";
        EasyExcel.write(fileName, ColumnData.class).sheet("書本詳情").doWrite(data(bookEntityList));
        System.out.println("excel文件生成完畢...");
    }
    private static List<ColumnData> data(List<BookEntity> bookEntityList) {
        List<ColumnData> list = new ArrayList<>();
        bookEntityList.forEach(b -> {
            ColumnData data = ColumnData.builder()
                    .bookName(b.getBookName())
                    .starsCount(b.getStarsCount())
                    .author(b.getBookDetailInfo().getAuthor())
                    .bookImgUrl(b.getBookImgUrl())
                    .description(b.getDescription())
                    .price(b.getBookDetailInfo().getPrice())
                    .build();
            list.add(data);
        });
        return list;
    }
}

4.4 最終展示效果

最終的效果如下圖：

以上就是從想法到實踐，我們就在實戰中使用了 jsoup 的基本操作。

完整代碼地址：https://github.com/hellowHuaairen/JsoupTest

五、最后

Java HTML Parser 庫：jsoup，把它當成簡單的爬蟲用起來還是很方便的吧？

為什么會講爬蟲？大數據，人工智能時代玩的就是數據，數據很重要。作為懂點技術的我們，也需要掌握一種獲取網絡數據的技能。當然也有一些工具 Fiddler、webscraper 等也可以抓取你想要的數據。

教程至此，你應該也能對 jsoup 有一些感覺了吧。編程是不是也特別有意思呢？參考我上面的實戰案例，有好多網站可以實踐一下啦～歡迎在評論區曬你的實戰。

加依賴

        <!--SpringMVC-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <!--SpringData Jpa-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
        <!--MySQL連接包-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.49</version>
        </dependency>
        <!-- HttpClient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
        </dependency>
        <!--Jsoup-->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
             <version>1.15.2</version>
        </dependency>
         <!--lombok-->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
復制代碼

配置application.properties

# MySQL配置
spring.datasource.driverClassName=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://localhost:3306/demo?useUnicode=true&characterEncoding=utf8
spring.datasource.username=root
spring.datasource.password=123456


# JPA配置
spring.jpa.database=MySQL
spring.jpa.show-sql=true
spring.jpa.generate-ddl=true
spring.jpa.hibernate.ddl-auto=update
spring.jpa.hibernate.naming_strategy=org.hibernate.cfg.ImprovedNamingStrategy

復制代碼

POJO

@Entity
@Table(name = "item")
@Data
public class Item {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    //標準產品單位
    private Long spu;
    //庫存量單位
    private Long sku;
    //商品標題
    private String title;
    //商品價格
    private Double price;
    //商品圖片
    private String pic;
    //商品詳情地址
    private String url;
    //店鋪;
    private String shop;
    //創建時間
    private Date created;
    //更新時間
    private Date updated;
}
復制代碼

Dao

public interface ItemDao extends JpaRepository<Item,Long> {
}
復制代碼

Service

public interface ItemService {

    /**
     * 保存商品
     *
     * @param item
     */
    void save(Item item);

    /**
     * 刪除所有商品
     */
    void deleteAll();
}


@Service
public class ItemServiceImpl implements ItemService {

    @Autowired
    private ItemDao itemDao;

    @Override
    @Transactional
    public void save(Item item) {
        this.itemDao.save(item);
    }

    @Override
    public void deleteAll() {
        this.itemDao.deleteAll();
    }
}
復制代碼

封裝HttpClient

@Component
public class HttpUtils {

    private static final String FILEPATH = "D:\\demo\\";

    private PoolingHttpClientConnectionManager cm;

    public HttpUtils() {
        this.cm = new PoolingHttpClientConnectionManager();
        //設置最大連接數
        this.cm.setMaxTotal(100);
        //設置每個主機的最大連接數
        this.cm.setDefaultMaxPerRoute(10);
    }

    /**
     * 根據請求地址下載頁面數據
     *
     * @param url
     * @return 頁面數據
     */
    public String doGetHtml(String url) {
        //獲取HttpClient對象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();
        //創建httpGet請求對象，設置url地址
        HttpGet httpGet = new HttpGet(url);
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36");
        //設置請求信息
        httpGet.setConfig(this.getConfig());
        CloseableHttpResponse response = null;
        try {
            //使用HttpClient發起請求，獲取響應
            response = httpClient.execute(httpGet);
            //解析響應，返回結果
            if (response.getStatusLine().getStatusCode() == 200) {
                //判斷響應體Entity是否不為空，如果不為空就可以使用EntityUtils
                if (response.getEntity() != null) {
                    String content = EntityUtils.toString(response.getEntity(), "utf8");
                    return content;
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //關閉response
            if (response != null) {
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        //返回空串
        return "";
    }


    /**
     * 下載圖片
     *
     * @param url
     * @return 圖片名稱
     */
    public String doGetImage(String url) {
        //獲取HttpClient對象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();
        //創建httpGet請求對象，設置url地址
        HttpGet httpGet = new HttpGet(url);
        //設置請求信息
        httpGet.setConfig(this.getConfig());
        CloseableHttpResponse response = null;
        try {
            //使用HttpClient發起請求，獲取響應
            response = httpClient.execute(httpGet);
            //解析響應，返回結果
            if (response.getStatusLine().getStatusCode() == 200) {
                //判斷響應體Entity是否不為空
                if (response.getEntity() != null) {
                    //獲取圖片的后綴
                    String extName = url.substring(url.lastIndexOf("."));
                    //創建圖片名，重命名圖片
                    String picName = UUID.randomUUID() + extName;
                    //聲明OutPutStream
                    OutputStream outputStream = new FileOutputStream(new File(FILEPATH + picName));
                    response.getEntity().writeTo(outputStream);
                    //返回圖片名稱
                    return picName;
                }
            }

        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //關閉response
            if (response != null) {
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        //如果下載失敗，返回空串
        return "";
    }

    /**
     * 設置請求信息
     *
     * @return
     */
    private RequestConfig getConfig() {
        RequestConfig config = RequestConfig.custom()
                //創建連接的最長時間
                .setConnectTimeout(1000)
                // 獲取連接的最長時間
                .setConnectionRequestTimeout(500)
                //數據傳輸的最長時間
                .setSocketTimeout(10000)
                .build();

        return config;
    }
}
復制代碼

SPU與SKU

SPU

SPU是商品信息聚合的最小單位，是一組可復用、易檢索的標準化信息的集合，該集合描述了一個產品的特性。

屬性值、特性相同的商品就可以稱為一個SPU。

如：某型號某配置某顏色的筆記本電腦就對應一個SPU，它有多種配置，或者多種顏色

SKU

SKU即庫存進出計量的單位，可以是以件、盒、托盤等為單位。SKU是物理上不可分割的最小存貨單元。在使用時要根據不同業態，不同管理模式來處理。

如：某型號的筆記本電腦有多種配置，8G+512G筆記本電腦就是一個SKU。

爬取分析

爬取筆記本電腦搜索頁面。進行分頁操作，得到分頁請求地址：https://search.jd.com/search?keyword=%E7%94%B5%E8%84%91&wq=%E7%94%B5%E8%84%91&pvid=56a110735c6c491c91416c194aed4c5b&cid3=672&cid2=671&s=56&click=0&page=

所有商品由一個class=J_goodsList的div包裹。div中則是由ul標簽包裹的li標簽，每一個li標簽對應一個商品信息。

li標簽包含的需要的商品信息

爬取邏輯

@Component
public class ItemTask {

    @Autowired
    private HttpUtils httpUtils;
    @Autowired
    private ItemService itemService;

    /**
     * 使用定時任務抓取最新數據
     *
     * @throws Exception
     */
    @Scheduled(fixedDelay = 50 * 1000)
    public void itemTask() throws Exception {
    	// 每次執行前請客數據
        itemService.deleteAll();
        
        //聲明需要解析的初始地址
        String url = "https://search.jd.com/search?keyword=%E7%94%B5%E8%84%91&wq=%E7%94%B5%E8%84%91&pvid=56a110735c6c491c91416c194aed4c5b&cid3=672&cid2=671&s=56&click=0&page=";

        // 按照頁面對搜索結果進行遍歷解析，注意頁面是奇數
        for (int i = 1; i < 10; i = i + 2) {
            String html = httpUtils.doGetHtml(url + i);
            // 解析頁面，獲取商品數據并存儲
            this.parse(html);
        }
        System.out.println("商品數據抓取完成！");
    }

    /**
     * 解析頁面，獲取商品數據并存儲
     *
     * @param html
     * @throws Exception
     */
    private void parse(String html) {
        // 解析html獲取Document
        Document doc = Jsoup.parse(html);
        // 獲取spu信息
        Elements spuEles = doc.select("div#J_goodsList > ul > li");

        // 循環列表中的SPU信息
        for (int i = 0; i < spuEles.size(); i++) {
            Element element = spuEles.get(i);
            //獲取spu
            String strSpu = element.attr("data-spu");
            if (strSpu == null || strSpu.equals("")) {
                continue;
            }
            long spu = Long.parseLong(strSpu);
            //獲取sku
            long sku = Long.parseLong(element.attr("data-sku"));

            Item item = new Item();
            //設置商品的spu
            item.setSpu(spu);
            //設置商品的sku
            item.setSku(sku);
            //獲取商品的詳情的url
            String itemUrl = "https://item.jd.com/" + sku + ".html";
            item.setUrl(itemUrl);

            // 獲取商品的圖片
            String picUrl = "https:" + element.select("div.p-img").select("a").select("img").attr("data-lazy-img");
            String picName = this.httpUtils.doGetImage(picUrl);
            item.setPic(picName);

            //獲取商品的價格
            String strPrice = element.select("div.p-price").select("i").text();
            item.setPrice(Double.parseDouble(strPrice));

            //獲取商品的標題
            String title = element.select("div.p-name").select("a").attr("title");
            item.setTitle(title);

            // 店鋪名稱
            String shopName = element.select("div.p-shop a").text();
            item.setShop(shopName);

            item.setCreated(new Date());
            item.setUpdated(item.getCreated());

            //保存商品數據到數據庫中
            this.itemService.save(item);
        }
    }
}
復制代碼

配置啟動類

@SpringBootApplication
// 開啟定時任務
@EnableScheduling
public class Application {
    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }
}
復制代碼

執行測試

啟動項目，執行測試。查看數據庫與本地下載照片。

在線咨詢

上一篇：前端：六大H5常用結構元素
下一篇：真的嗎，Java 的 JSP 已經被淘汰了？

您的項目需求

*請認真填寫需求信息，我們會在24小時內與您取得聯系。

整合營銷服務商