实战 | JAVA读取WORD，包含表格。_小言_互联网的博客

实战 | JAVA读取WORD，包含表格。

2021-03-17 08:39 812人阅读评论(0)

不能每天都发鸡汤呀，今天分享一篇开发实战。

业务需求

我们有这样一个需求，需要抽取出WORD文档中的内容，然后组装成特定的json格式发送给第三方引擎接口，输入协议如下：


   
    
     
      
     
     
      
       {
      
     
    
     
      
     
     
      
           
       "tables": [
      
     
    
     
      
     
     
      
               {
      
     
    
     
      
     
     
      
                   
       "cells": [
      
     
    
     
      
     
     
      
                       {
      
     
    
     
      
     
     
      
                           
       "col": 
       1,
      
     
    
     
      
     
     
      
                           
       "row_span": 
       1,
      
     
    
     
      
     
     
      
                           
       "row": 
       1,
      
     
    
     
      
     
     
      
                           
       "col_span": 
       1,
      
     
    
     
      
     
     
      
                           
       "content": 
       "车辆名称"
      
     
    
     
      
     
     
      
                       }
      
     
    
     
      
     
     
      
                   ],
      
     
    
     
      
     
     
      
                   
       "id": 
       0,
      
     
    
     
      
     
     
      
                   
       "row_num": 
       2
      
     
    
     
      
     
     
      
               }
      
     
    
     
      
     
     
      
           ],
      
     
    
     
      
     
     
      
           
       "paragraps": [
      
     
    
     
      
     
     
      
               {
      
     
    
     
      
     
     
      
                   
       "para_id": 
       1,
      
     
    
     
      
     
     
      
                   
       "content": 
       "Hello,JAVA日知录"
      
     
    
     
      
     
     
      
               }
      
     
    
     
      
     
     
      
           ]
      
     
    
     
      
     
     
      
       }

这个输入格式一看就是需要我们分段落和表格读取word中的内容，既然需求已定，那就直接开始动手写代码吧。

基于POI实现

把 “java如何读取word” 拿到百度去搜索，答案基本都是利用POI来实现。当然利用POI确实可以实现按段落和表格提取出内容并组装成上述格式，但是在实践过程中有下面2个问题：

需要分别处理两种格式docx、docPOI使用不同的API来读取docx和doc，所以读取逻辑我们需要编写两次。
POI读取doc的段落时会把表格的内容也读取出来这个问题比较坑，poi有单独的方法读取文档中所有表格，但是在读取doc格式段落文档的时候会把表格内容也读取出来，所以我们需要用如下方法排除掉表格：


   
    
     
      
     
     
      
       //读取doc
      
     
    
     
      
     
     
      
       HWPFDocument doc = 
       new HWPFDocument(stream);
      
     
    
     
      
     
     
      
       Range 
       range = doc.getRange();
      
     
    
     
      
     
     
       
      
     
    
     
      
     
     
      
       //读取段落
      
     
    
     
      
     
     
      
       int num = 
       range.numParagraphs();
      
     
    
     
      
     
     
      
       Paragraph para;
      
     
    
     
      
     
     
      
       for (
       int i=
       0; i<num; i++) {
      
     
    
     
      
     
     
      
           para = 
       range.getParagraph(i);
      
     
    
     
      
     
     
      
           
       //排除表格内容
      
     
    
     
      
     
     
      
           
       if (!para.isInTable()) {
      
     
    
     
      
     
     
      
               System.out.
       println(para.text());
      
     
    
     
      
     
     
      
           }
      
     
    
     
      
     
     
      
       }

考虑以上两种原因，我们最后并没有采取POI来实现word内容提取功能，而是采用第二种方法，即利用 Spire.Doc for Java 来实现。

Spire.Doc for Java

Spire.Doc for Java 是一款专业的 Java Word 组件，开发人员使用它可以轻松地将 Word 文档创建、读取、编辑、转换和打印等功能集成到自己的 Java 应用程序中。
作为一款完全独立的组件，Spire.Doc for Java 的运行环境无需安装 Microsoft Office。官网地址是 https://www.e-iceblue.cn/，我们项目中使用的开源免费版。

首先我们修改maven仓库地址


   
    
     
      
     
     
      
       <repositories>
      
     
    
     
      
     
     
      
           <repository>
      
     
    
     
      
     
     
      
               <id>com.e-iceblue</id>
      
     
    
     
      
     
     
      
               <url>http:
       //repo.e-iceblue.com/nexus/content/groups/public/</url>
      
     
    
     
      
     
     
      
           </repository>
      
     
    
     
      
     
     
      
       </repositories>

引入对应的jar包


   
    
     
      
     
     
      
       <dependency>
      
     
    
     
      
     
     
      
           <groupId>e-iceblue</groupId>
      
     
    
     
      
     
     
      
           <artifactId>spire.doc.free</artifactId>
      
     
    
     
      
     
     
      
           <version>
       3.9
       .0</version>
      
     
    
     
      
     
     
      
       </dependency>

读取word，这里展示的是测试类


   
    
     
      
     
     
      
       public class SpireApplication {
      
     
    
     
      
     
     
       
      
     
    
     
      
     
     
      
           public static void main(String[] args) {
      
     
    
     
      
     
     
      
               String path = 
       "D:\\testDoc22.doc";
      
     
    
     
      
     
     
      
               spireParaghDoc(path);
      
     
    
     
      
     
     
      
               spireForTableOfDoc(path); 
      
     
    
     
      
     
     
      
           }
      
     
    
     
      
     
     
       
      
     
    
     
      
     
     
      
           
       //读取段落
      
     
    
     
      
     
     
      
           public static void spireParaghDoc(String path) {
      
     
    
     
      
     
     
      
               Document doc = 
       new Document(path);
      
     
    
     
      
     
     
      
               
       for (
       int i = 
       0; i < doc.getSections().getCount(); i++) {
      
     
    
     
      
     
     
      
                   Section p = doc.getSections().get(i);
      
     
    
     
      
     
     
      
                   
       for (
       int j = 
       0; j < p.getParagraphs().getCount(); j++) {
      
     
    
     
      
     
     
      
                       Paragraph paragraph = p.getParagraphs().get(j);
      
     
    
     
      
     
     
      
                       System.out.
       println(paragraph.getText());
      
     
    
     
      
     
     
      
                   }
      
     
    
     
      
     
     
      
               }
      
     
    
     
      
     
     
      
           }
      
     
    
     
      
     
     
       
      
     
    
     
      
     
     
      
           
       //读取表格
      
     
    
     
      
     
     
      
           public static void spireForTableOfDoc(String path) {
      
     
    
     
      
     
     
      
               Document doc = 
       new Document(path);
      
     
    
     
      
     
     
      
               
       for (
       int i = 
       0; i < doc.getSections().getCount(); i++) {
      
     
    
     
      
     
     
      
                   Section p = doc.getSections().get(i);
      
     
    
     
      
     
     
      
                   
       for (
       int j = 
       0; j < p.getBody().getChildObjects().getCount(); j++) {
      
     
    
     
      
     
     
      
                       DocumentObject obj = p.getBody().getChildObjects().get(j);
      
     
    
     
      
     
     
      
                       
       if (obj.getDocumentObjectType() == DocumentObjectType.Table) {
      
     
    
     
      
     
     
      
                           Table table = (Table) obj;
      
     
    
     
      
     
     
      
                           
       for (
       int k = 
       0; k < table.getRows().getCount(); k++) {
      
     
    
     
      
     
     
      
                               TableRow rows = table.getRows().get(k);
      
     
    
     
      
     
     
      
                               
       for (
       int p = 
       0; p < rows.getCells().getCount(); p++) {
      
     
    
     
      
     
     
      
                                   
       for (
       int h = 
       0; h < rows.getCells().get(p).getParagraphs().getCount(); h++) {
      
     
    
     
      
     
     
      
                                       Paragraph f = rows.getCells().get(p).getParagraphs().get(h);
      
     
    
     
      
     
     
      
                                       System.out.
       println(f.getText());
      
     
    
     
      
     
     
      
                                   }
      
     
    
     
      
     
     
      
                               }
      
     
    
     
      
     
     
      
                           }
      
     
    
     
      
     
     
      
                       }
      
     
    
     
      
     
     
      
                   }
      
     
    
     
      
     
     
      
               }
      
     
    
     
      
     
     
      
           }
      
     
    
     
      
     
     
       
      
     
    
     
      
     
     
      
       }

通过上面代码我们就可以按段落和表格读取WORD中的内容，而后根据系统业务要求的格式进行封装即可。

以上，希望对你有所帮助！

End

干货分享

这里为大家准备了一份小小的礼物，关注公众号，输入如下代码，即可获得百度网盘地址，无套路领取！

001：《程序员必读书籍》
002：《从无到有搭建中小型互联网公司后台服务架构与运维架构》
003：《互联网企业高并发解决方案》
004：《互联网架构教学视频》
006：《SpringBoot实现点餐系统》
007：《SpringSecurity实战视频》
008：《Hadoop实战教学视频》
009：《腾讯2019Techo开发者大会PPT》

010：微信交流群

近期热文top

1、关于JWT Token 自动续期的解决方案