word轉(zhuǎn)html小工具

啥要做這個軟件

部分編輯器排版不是很方便，但對html代碼支持程度較好。

word不是直接能裝html了嗎？你這個有啥不一樣

word自帶轉(zhuǎn)html代碼不是很好，內(nèi)嵌了很多字體和顏色格式，而這些一般第三方編輯器并不支持。
我這個直接就是比較簡單的html代碼，比如h1、em、p之類的簡單代碼，理論上支持大部分第三方html格式。

軟件特色

基于PyQt5打包，所以軟件比較大，30多M。
基于mammoth模塊開發(fā)，這個提供了doc,docx轉(zhuǎn)html的核心

軟件界面

軟件獲取方式：私信“word”即可獲取

近有一個業(yè)務(wù)是前端要上傳word格式的文稿，然后用戶上傳完之后，可以用瀏覽器直接查看該文稿，并且可以在富文本框直接引用該文稿，所以上傳word文稿之后，后端保存到db的必須是html格式才行，所以涉及到word格式轉(zhuǎn)html格式。

通過調(diào)查，這個word和html的處理，有兩種方案，方案1是前端做這個轉(zhuǎn)換。方案2是把word文檔上傳給后臺，后臺轉(zhuǎn)換好之后再返回給前端。至于方案1，看到大家的反饋都說很多問題，所以就沒采用前端轉(zhuǎn)的方案，最終決定是后端轉(zhuǎn)化為html格式并返回給前段預(yù)覽，待客戶預(yù)覽的時候，確認(rèn)格式?jīng)]問題之后，再把html保存到后臺（因?yàn)閣ord涉及到的格式太多，比如圖片，visio圖，表格，圖片等等之類的復(fù)雜元素，轉(zhuǎn)html的時候，可能會很多格式問題，所以要有個預(yù)覽的過程）。

對于word中普通的文字，問題倒不大，主要是文本之外的元素的處理，比如圖片，視頻，表格等。針對我本次的文章，只處理了圖片，處理的方式是：后臺從word中找出圖片（當(dāng)然引入的jar包已經(jīng)帶了獲取word中圖片的功能），上傳到服務(wù)器，拿到絕對路徑之后，放入到html里面，這樣，返回給前端的html內(nèi)容，就可以直接預(yù)覽了。

maven引入相關(guān)依賴包如下：

 <poi-scratchpad.version>3.14</poi-scratchpad.version>
        <poi-ooxml.version>3.14</poi-ooxml.version>
        <xdocreport.version>1.0.6</xdocreport.version>
        <poi-ooxml-schemas.version>3.14</poi-ooxml-schemas.version>
        <ooxml-schemas.version>1.3</ooxml-schemas.version>
        <jsoup.version>1.11.3</jsoup.version>

<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>${poi-scratchpad.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>${poi-ooxml.version}</version>
        </dependency>
        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>xdocreport</artifactId>
            <version>${xdocreport.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>${poi-ooxml-schemas.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>ooxml-schemas</artifactId>
            <version>${ooxml-schemas.version}</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>${jsoup.version}</version>
        </dependency>

word轉(zhuǎn)html，對于word2003和word2007轉(zhuǎn)換方式不一樣，因?yàn)閣ord2003和word2007的格式不一樣，工具類如下：

使用方法如下：

public String uploadSourceNews(MultipartFile file)  {
        String fileName = file.getOriginalFilename();
        String suffixName = fileName.substring(fileName.lastIndexOf("."));
        if (!".doc".equals(suffixName) && !".docx".equals(suffixName)) {
            throw new UploadFileFormatException();
        }
        DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyyMM");
        String dateDir = formatter.format(LocalDate.now());
        String directory = imageDir + "/" + dateDir + "/";
        String content = null;
        try {
            InputStream inputStream = file.getInputStream();
            if ("doc".equals(suffixName)) {
                content = wordToHtmlUtil.Word2003ToHtml(inputStream, imageBucket, directory, Constants.HTTPS_PREFIX + imageVisitHost);
            } else {
                content = wordToHtmlUtil.Word2007ToHtml(inputStream, imageBucket, directory, Constants.HTTPS_PREFIX + imageVisitHost);
            }
        } catch (Exception ex) {
            logger.error("word to html exception, detail:", ex);
            return null;
        }
        return content;
    }

關(guān)于doc和docx的一些存儲格式介紹：

docx 是微軟開發(fā)的基于 xml 的文字處理文件。docx 文件與 doc 文件不同, 因?yàn)?docx 文件將數(shù)據(jù)存儲在單獨(dú)的壓縮文件和文件夾中。早期版本的 microsoft office (早于 office 2007) 不支持 docx 文件, 因?yàn)?docx 是基于 xml 的, 早期版本將 doc 文件另存為單個二進(jìn)制文件。
DOCX is an XML based word processing file developed by Microsoft. DOCX files are different than DOC files as DOCX files store data in separate compressed files and folders. Earlier versions of Microsoft Office (earlier than Office 2007) do not support DOCX files because DOCX is XML based where the earlier versions save DOC file as a single binary file.

可能你會問了，明明是docx結(jié)尾的文檔，怎么成了xml格式了？

很簡單:你隨便選擇一個docx文件，右鍵使用壓縮工具打開，就能得到一個這樣的目錄結(jié)構(gòu)：

所以你以為docx是一個完整的文檔，其實(shí)它只是一個壓縮文件。

參考：

https://www.cnblogs.com/ct-csu/p/8178932.html

、前言

實(shí)現(xiàn)文檔在線預(yù)覽的方式除了上篇文章文檔在線預(yù)覽新版（一）通過將文件轉(zhuǎn)成圖片實(shí)現(xiàn)在線預(yù)覽功能說的將文檔轉(zhuǎn)成圖片的實(shí)現(xiàn)方式外，還有轉(zhuǎn)成pdf，前端通過pdf.js、pdfobject.js等插件來實(shí)現(xiàn)在線預(yù)覽，以及本文將要說到的將文檔轉(zhuǎn)成html的方式來實(shí)現(xiàn)在線預(yù)覽。

以下代碼分別提供基于aspose、pdfbox、spire來實(shí)現(xiàn)來實(shí)現(xiàn)txt、word、pdf、ppt、word等文件轉(zhuǎn)圖片的需求。

1、aspose

Aspose 是一家致力于.Net ,Java,SharePoint,JasperReports和SSRS組件的提供商，數(shù)十個國家的數(shù)千機(jī)構(gòu)都有用過aspose組件，創(chuàng)建、編輯、轉(zhuǎn)換或渲染 Office、OpenOffice、PDF、圖像、ZIP、CAD、XPS、EPS、PSD 和更多文件格式。注意aspose是商用組件，未經(jīng)授權(quán)導(dǎo)出文件里面都是是水印（尊重版權(quán)，遠(yuǎn)離破解版）。

需要在項(xiàng)目的pom文件里添加如下依賴

        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-words</artifactId>
            <version>23.1</version>
        </dependency>
        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-pdf</artifactId>
            <version>23.1</version>
        </dependency>
        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-cells</artifactId>
            <version>23.1</version>
        </dependency>
        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-slides</artifactId>
            <version>23.1</version>
        </dependency>

2 、poi + pdfbox

因?yàn)閍spose和spire雖然好用，但是都是是商用組件，所以這里也提供使用開源庫操作的方式的方式。

POI是Apache軟件基金會用Java編寫的免費(fèi)開源的跨平臺的 Java API，Apache POI提供API給Java程序?qū)icrosoft Office格式檔案讀和寫的功能。

Apache PDFBox是一個開源Java庫，支持PDF文檔的開發(fā)和轉(zhuǎn)換。使用此庫，您可以開發(fā)用于創(chuàng)建，轉(zhuǎn)換和操作PDF文檔的Java程序。

需要在項(xiàng)目的pom文件里添加如下依賴

		<dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.4</version>
        </dependency>
		<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>5.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>5.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>5.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-excelant</artifactId>
            <version>5.2.0</version>
        </dependency>

3 spire

spire一款專業(yè)的Office編程組件，涵蓋了對Word、Excel、PPT、PDF等文件的讀寫、編輯、查看功能。spire提供免費(fèi)版本，但是存在只能導(dǎo)出前3頁以及只能導(dǎo)出前500行的限制，只要達(dá)到其一就會觸發(fā)限制。需要超出前3頁以及只能導(dǎo)出前500行的限制的這需要購買付費(fèi)版（尊重版權(quán)，遠(yuǎn)離破解版）。這里使用免費(fèi)版進(jìn)行演示。

spire在添加pom之前還得先添加maven倉庫來源

		<repository>
            <id>com.e-iceblue</id>
            <name>e-iceblue</name>
            <url>https://repo.e-iceblue.cn/repository/maven-public/</url>
        </repository>

接著在項(xiàng)目的pom文件里添加如下依賴

免費(fèi)版：

		<dependency>
            <groupId>e-iceblue</groupId>
            <artifactId>spire.office.free</artifactId>
            <version>5.3.1</version>
        </dependency>

付費(fèi)版版：

		<dependency>
            <groupId>e-iceblue</groupId>
            <artifactId>spire.office</artifactId>
            <version>5.3.1</version>
        </dependency>

二、將文件轉(zhuǎn)換成html字符串

1、將word文件轉(zhuǎn)成html字符串

1.1 使用aspose

public static String wordToHtmlStr(String wordPath) {
        try {
            Document doc = new Document(wordPath); // Address是將要被轉(zhuǎn)化的word文檔
            String htmlStr = doc.toString();
            return htmlStr;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }

驗(yàn)證結(jié)果：

1.2 使用poi

public String wordToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
        String htmlStr = null;
        String ext = wordPath.substring(wordPath.lastIndexOf("."));
        if (ext.equals(".docx")) {
            htmlStr = word2007ToHtmlStr(wordPath);
        } else if (ext.equals(".doc")){
            htmlStr = word2003ToHtmlStr(wordPath);
        } else {
            throw new RuntimeException("文件格式不正確");
        }
        return htmlStr;
    }

    public String word2007ToHtmlStr(String wordPath) throws IOException {
        // 使用內(nèi)存輸出流
        try(ByteArrayOutputStream out = new ByteArrayOutputStream()){
            word2007ToHtmlOutputStream(wordPath, out);
            return out.toString();
        }
    }

    private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
        ZipSecureFile.setMinInflateRatio(-1.0d);
        InputStream in = Files.newInputStream(Paths.get(wordPath));
        XWPFDocument document = new XWPFDocument(in);
        XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
        // 使用內(nèi)存輸出流
        XHTMLConverter.getInstance().convert(document, out, options);
    }


    private String word2003ToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
        org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
        // Transform document to string
        StringWriter writer = new StringWriter();
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer transformer = tf.newTransformer();
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
        transformer.setOutputProperty(OutputKeys.METHOD, "html");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        transformer.transform(new DOMSource(htmlDocument), new StreamResult(writer));
        return writer.toString();
    }

private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
        InputStream input = Files.newInputStream(Paths.get(wordPath));
        HWPFDocument wordDocument = new HWPFDocument(input);
        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());
        wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
            System.out.println(pictureType);
            if (PictureType.UNKNOWN.equals(pictureType)) {
                return null;
            }
            BufferedImage bufferedImage = ImgUtil.toImage(content);
            String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
            //  帶圖片的word，則將圖片轉(zhuǎn)為base64編碼，保存在一個頁面中
            StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
            return sb.toString();
        });
        // 解析word文檔
        wordToHtmlConverter.processDocument(wordDocument);
        return wordToHtmlConverter.getDocument();
    }

1.3 使用spire

 public String wordToHtmlStr(String wordPath) throws IOException {
        try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            Document document = new Document();
            document.loadFromFile(wordPath);
            document.saveToFile(outputStream, FileFormat.Html);
            return outputStream.toString();
        }
    }

2、將pdf文件轉(zhuǎn)成html字符串

2.1 使用aspose

public static String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new StringWriter();
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
        return writer.toString();
    }

驗(yàn)證結(jié)果：

2.2 使用 poi + pbfbox

public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new StringWriter();
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
        return writer.toString();
    }

2.3 使用spire

public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
        try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            PdfDocument pdf = new PdfDocument();
            pdf.loadFromFile(pdfPath);
            return outputStream.toString();
        }
    }

3、將excel文件轉(zhuǎn)成html字符串

3.1 使用aspose

public static String excelToHtmlStr(String excelPath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(excelPath);
        Workbook workbook = new XSSFWorkbook(fileInputStream);
        DataFormatter dataFormatter = new DataFormatter();
        FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
        Sheet sheet = workbook.getSheetAt(0);
        StringBuilder htmlStringBuilder = new StringBuilder();
        htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
        htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
        htmlStringBuilder.append("</head><body><table>");
        for (Row row : sheet) {
            htmlStringBuilder.append("<tr>");
            for (Cell cell : row) {
                CellType cellType = cell.getCellType();
                if (cellType == CellType.FORMULA) {
                    formulaEvaluator.evaluateFormulaCell(cell);
                    cellType = cell.getCachedFormulaResultType();
                }
                String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
            }
            htmlStringBuilder.append("</tr>");
        }
        htmlStringBuilder.append("</table></body></html>");
        return htmlStringBuilder.toString();
    }

返回的html字符串：

<html><head><title>Excel to HTML using Java and POI library</title><style>table, th, td { border: 1px solid black; }</style></head><body><table><tr><td>序號</td><td>姓名</td><td>性別</td><td>聯(lián)系方式</td><td>地址</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr></table></body></html>

3.2 使用poi + pdfbox

public String excelToHtmlStr(String excelPath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(excelPath);
        try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
            DataFormatter dataFormatter = new DataFormatter();
            FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
            org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
            StringBuilder htmlStringBuilder = new StringBuilder();
            htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
            htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
            htmlStringBuilder.append("</head><body><table>");
            for (Row row : sheet) {
                htmlStringBuilder.append("<tr>");
                for (Cell cell : row) {
                    CellType cellType = cell.getCellType();
                    if (cellType == CellType.FORMULA) {
                        formulaEvaluator.evaluateFormulaCell(cell);
                        cellType = cell.getCachedFormulaResultType();
                    }
                    String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                    htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
                }
                htmlStringBuilder.append("</tr>");
            }
            htmlStringBuilder.append("</table></body></html>");
            return htmlStringBuilder.toString();
        }
    }

3.3 使用spire

public String excelToHtmlStr(String excelPath) throws Exception {
        try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            Workbook workbook = new Workbook();
            workbook.loadFromFile(excelPath);
            workbook.saveToStream(outputStream, com.spire.xls.FileFormat.HTML);
            return outputStream.toString();
        }
    }

三、將文件轉(zhuǎn)換成html，并生成html文件

有時我們是需要的不僅僅返回html字符串，而是需要生成一個html文件這時應(yīng)該怎么做呢？一個改動量小的做法就是使用org.apache.commons.io包下的FileUtils工具類寫入目標(biāo)地址：

FileUtils類將html字符串生成html文件示例：

首先需要引入pom：

		<dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.8.0</version>
        </dependency>

相關(guān)代碼：

String htmlStr = FileConvertUtil.pdfToHtmlStr("D:\\書籍\\電子書\\小說\\歷史小說\\最后的可汗.doc");
FileUtils.write(new File("D:\\test\\doc.html"), htmlStr, "utf-8");

除此之外，還可以對上面的代碼進(jìn)行一些調(diào)整，已實(shí)現(xiàn)生成html文件，代碼調(diào)整如下：

1、將word文件轉(zhuǎn)換成html文件

word原文件效果：

1.1 使用aspose

public static void wordToHtml(String wordPath, String htmlPath) {
        try {
            File sourceFile = new File(wordPath);
            String path = htmlPath + File.separator + sourceFile.getName().substring(0, sourceFile.getName().lastIndexOf(".")) + ".html";
            File file = new File(path); // 新建一個空白pdf文檔
            FileOutputStream os = new FileOutputStream(file);
            Document doc = new Document(wordPath); // Address是將要被轉(zhuǎn)化的word文檔
            HtmlSaveOptions options = new HtmlSaveOptions();
            options.setExportImagesAsBase64(true);
            options.setExportRelativeFontSize(true);
            doc.save(os, options);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

轉(zhuǎn)換成html的效果：

1.2 使用poi + pdfbox

public void wordToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
        htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
        String ext = wordPath.substring(wordPath.lastIndexOf("."));
        if (ext.equals(".docx")) {
            word2007ToHtml(wordPath, htmlPath);
        } else if (ext.equals(".doc")){
            word2003ToHtml(wordPath, htmlPath);
        } else {
            throw new RuntimeException("文件格式不正確");
        }
    }

    public void word2007ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
        //try(OutputStream out = Files.newOutputStream(Paths.get(path))){
        try(FileOutputStream out = new FileOutputStream(htmlPath)){
            word2007ToHtmlOutputStream(wordPath, out);
        }
    }

    private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
        ZipSecureFile.setMinInflateRatio(-1.0d);
        InputStream in = Files.newInputStream(Paths.get(wordPath));
        XWPFDocument document = new XWPFDocument(in);
        XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
        // 使用內(nèi)存輸出流
        XHTMLConverter.getInstance().convert(document, out, options);
    }

    public void word2003ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
        org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
        // 生成html文件地址

        try(OutputStream outStream = Files.newOutputStream(Paths.get(htmlPath))){
            DOMSource domSource = new DOMSource(htmlDocument);
            StreamResult streamResult = new StreamResult(outStream);
            TransformerFactory factory = TransformerFactory.newInstance();
            Transformer serializer = factory.newTransformer();
            serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
            serializer.setOutputProperty(OutputKeys.INDENT, "yes");
            serializer.setOutputProperty(OutputKeys.METHOD, "html");
            serializer.transform(domSource, streamResult);
        }
    }

    private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
        InputStream input = Files.newInputStream(Paths.get(wordPath));
        HWPFDocument wordDocument = new HWPFDocument(input);
        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());
        wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
            System.out.println(pictureType);
            if (PictureType.UNKNOWN.equals(pictureType)) {
                return null;
            }
            BufferedImage bufferedImage = ImgUtil.toImage(content);
            String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
            //  帶圖片的word，則將圖片轉(zhuǎn)為base64編碼，保存在一個頁面中
            StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
            return sb.toString();
        });
        // 解析word文檔
        wordToHtmlConverter.processDocument(wordDocument);
        return wordToHtmlConverter.getDocument();
    }

轉(zhuǎn)換成html的效果：

1.3 使用spire

public void wordToHtml(String wordPath, String htmlPath) {
        htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
        Document document = new Document();
        document.loadFromFile(wordPath);
        document.saveToFile(htmlPath, FileFormat.Html);
    }

轉(zhuǎn)換成html的效果：

因?yàn)槭褂玫氖敲赓M(fèi)版，存在頁數(shù)和字?jǐn)?shù)限制，需要完整功能的的可以選擇付費(fèi)版本。PS：這回76頁的文檔居然轉(zhuǎn)成功了前50頁。

2、將pdf文件轉(zhuǎn)換成html文件

圖片版pdf原文件效果：

文字版pdf原文件效果：

2.1 使用aspose

public static void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
        File file = new File(pdfPath);
        String path = htmlPath + File.separator + file.getName().substring(0, file.getName().lastIndexOf(".")) + ".html";
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new PrintWriter(path, "UTF-8");
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
    }

圖片版PDF文件驗(yàn)證結(jié)果：

文字版PDF文件驗(yàn)證結(jié)果：

2.2 使用poi + pdfbox

public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
        String path = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new PrintWriter(path, "UTF-8");
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
    }

圖片版PDF文件驗(yàn)證結(jié)果：

文字版PDF原文件效果：

2.3 使用spire

public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
        htmlPath = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
        PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile(pdfPath);
        pdf.saveToFile(htmlPath, com.spire.pdf.FileFormat.HTML);
    }

圖片版PDF文件驗(yàn)證結(jié)果：
因?yàn)槭褂玫氖敲赓M(fèi)版，所以只有前三頁是正常的。。。有超過三頁需求的可以選擇付費(fèi)版本。

文字版PDF原文件效果：

報錯了無法轉(zhuǎn)換。。。

java.lang.NullPointerException
	at com.spire.pdf.PdfPageWidget.spr┢?(Unknown Source)
	at com.spire.pdf.PdfPageWidget.getSize(Unknown Source)
	at com.spire.pdf.PdfPageBase.spr???—(Unknown Source)
	at com.spire.pdf.PdfPageBase.getActualSize(Unknown Source)
	at com.spire.pdf.PdfPageBase.getSection(Unknown Source)
	at com.spire.pdf.general.PdfDestination.spr︻┎?—(Unknown Source)
	at com.spire.pdf.general.PdfDestination.spr┻┑?—(Unknown Source)
	at com.spire.pdf.general.PdfDestination.getElement(Unknown Source)
	at com.spire.pdf.primitives.PdfDictionary.setProperty(Unknown Source)
	at com.spire.pdf.bookmarks.PdfBookmark.setDestination(Unknown Source)
	at com.spire.pdf.bookmarks.PdfBookmarkWidget.spr┭┘?—(Unknown Source)
	at com.spire.pdf.bookmarks.PdfBookmarkWidget.getDestination(Unknown Source)
	at com.spire.pdf.PdfDocumentBase.spr??(Unknown Source)
	at com.spire.pdf.widget.PdfPageCollection.spr┦?(Unknown Source)
	at com.spire.pdf.widget.PdfPageCollection.removeAt(Unknown Source)
	at com.spire.pdf.PdfDocumentBase.spr┞?(Unknown Source)
	at com.spire.pdf.PdfDocument.loadFromFile(Unknown Source)

3、將excel文件轉(zhuǎn)換成html文件

excel原文件效果：

3.1 使用aspose

public void excelToHtml(String excelPath, String htmlPath) throws Exception {
        htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
        Workbook workbook = new Workbook(excelPath);
        com.aspose.cells.HtmlSaveOptions options = new com.aspose.cells.HtmlSaveOptions();
        workbook.save(htmlPath, options);
    }

轉(zhuǎn)換成html的效果：

3.2 使用poi

public void excelToHtml(String excelPath, String htmlPath) throws Exception {
        String path = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
        try(FileOutputStream fileOutputStream = new FileOutputStream(path)){
            String htmlStr = excelToHtmlStr(excelPath);
            byte[] bytes = htmlStr.getBytes();
            fileOutputStream.write(bytes);
        }
    }


    public String excelToHtmlStr(String excelPath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(excelPath);
        try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
            DataFormatter dataFormatter = new DataFormatter();
            FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
            org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
            StringBuilder htmlStringBuilder = new StringBuilder();
            htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
            htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
            htmlStringBuilder.append("</head><body><table>");
            for (Row row : sheet) {
                htmlStringBuilder.append("<tr>");
                for (Cell cell : row) {
                    CellType cellType = cell.getCellType();
                    if (cellType == CellType.FORMULA) {
                        formulaEvaluator.evaluateFormulaCell(cell);
                        cellType = cell.getCachedFormulaResultType();
                    }
                    String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                    htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
                }
                htmlStringBuilder.append("</tr>");
            }
            htmlStringBuilder.append("</table></body></html>");
            return htmlStringBuilder.toString();
        }
    }

轉(zhuǎn)換成html的效果：

3.3 使用spire

public void excelToHtml(String excelPath, String htmlPath) throws Exception {
        htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
        Workbook workbook = new Workbook();
        workbook.loadFromFile(excelPath);
        workbook.saveToFile(htmlPath, com.spire.xls.FileFormat.HTML);
    }

轉(zhuǎn)換成html的效果：

四、總結(jié)

從上述的效果展示我們可以發(fā)現(xiàn)其實(shí)轉(zhuǎn)成html效果不是太理想，很多細(xì)節(jié)樣式?jīng)]有還原，這其實(shí)是因?yàn)檫@類轉(zhuǎn)換往往都是追求目標(biāo)是通過使用文檔中的語義信息并忽略其他細(xì)節(jié)來生成簡單干凈的 HTML，所以在轉(zhuǎn)換過程中復(fù)雜樣式被忽略，比如居中、首行縮進(jìn)、字體，文本大小，顏色。舉個例子在轉(zhuǎn)換是會將應(yīng)用標(biāo)題 1 樣式的任何段落轉(zhuǎn)換為 h1 元素，而不是嘗試完全復(fù)制標(biāo)題的樣式。所以轉(zhuǎn)成html的顯示效果往往和原文檔不太一樣。這意味著對于較復(fù)雜的文檔而言，這種轉(zhuǎn)換不太可能是完美的。但如果都是只使用簡單樣式文檔或者對文檔樣式不太關(guān)心的這種方式也不妨一試。

PS：如果想要展示效果好的話，其實(shí)可以將上篇文章《文檔在線預(yù)覽（一）通過將txt、word、pdf轉(zhuǎn)成圖片實(shí)現(xiàn)在線預(yù)覽功能》說的內(nèi)容和本文結(jié)合起來使用，即將文檔里的內(nèi)容都生成成圖片（很可能是多張圖片），然后將生成的圖片全都放到一個html頁面里，用html+css來保持樣式并實(shí)現(xiàn)多張圖片展示，再將html返回。開源組件kkfilevie就是用的就是這種做法。

kkfileview展示效果如下：

下圖是kkfileview返回的html代碼，從html代碼我們可以看到kkfileview其實(shí)是將文件（txt文件除外）每頁的內(nèi)容都轉(zhuǎn)成了圖片，然后將這些圖片都嵌入到一個html里，再返回給用戶一個html頁面。

在線咨詢

上一篇：大神的Javascript基礎(chǔ)總結(jié)，還不收藏起來？
下一篇：在網(wǎng)頁開發(fā)中，我們需要掌握的常用HTML標(biāo)簽有哪些？

您的項(xiàng)目需求

*請認(rèn)真填寫需求信息，我們會在24小時內(nèi)與您取得聯(lián)系。

整合營銷服務(wù)商

word轉(zhuǎn)html小工具

word不是直接能裝html了嗎？你這個有啥不一樣

軟件特色

軟件界面

、前言

1、aspose

2 、poi + pdfbox

3 spire

二、將文件轉(zhuǎn)換成html字符串

1、將word文件轉(zhuǎn)成html字符串

1.1 使用aspose

1.2 使用poi

1.3 使用spire

2、將pdf文件轉(zhuǎn)成html字符串

2.1 使用aspose

2.2 使用 poi + pbfbox

2.3 使用spire

3、將excel文件轉(zhuǎn)成html字符串

3.1 使用aspose

3.2 使用poi + pdfbox

3.3 使用spire

三、將文件轉(zhuǎn)換成html，并生成html文件

FileUtils類將html字符串生成html文件示例：

1、將word文件轉(zhuǎn)換成html文件

1.1 使用aspose

1.2 使用poi + pdfbox

1.3 使用spire

2、將pdf文件轉(zhuǎn)換成html文件

2.1 使用aspose

2.2 使用poi + pdfbox

2.3 使用spire

3、將excel文件轉(zhuǎn)換成html文件

3.1 使用aspose

3.2 使用poi

3.3 使用spire

四、總結(jié)

您的項(xiàng)目需求