如何使用 Java 读取 Excel、docx、pdf 和 txt 文件？

1.举个栗子

以下是本人在开发过程中，读取”doc”、”docx”、”pdf” 和 “txt” 文件的代码例子，后面将详细解释。

txt文件读取不多说，用流读取。

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

//fileExtension文件后缀名
private String readFileContent(MultipartFile file, String fileExtension) throws IOException {
byte[] fileBytes = file.getBytes();
if (fileBytes.length == 0){
throw new BusinessException(ResultCodeEnum.FILE_CONTENT_IS_EMPTY);
}
switch (fileExtension) {
case “txt”:
return new String(fileBytes, StandardCharsets.UTF_8);
case “pdf”:
try (PDDocument doc = PDDocument.load(file.getInputStream())) {
PDFTextStripper textStripper = new PDFTextStripper();
return textStripper.getText(doc);
}
case “docx”:
try (InputStream stream = file.getInputStream()) {
XWPFDocument xdoc = new XWPFDocument(stream);
XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc);
return extractor.getText();
}
case “doc”:
try (InputStream stream = file.getInputStream()) {
WordExtractor extractor = new WordExtractor(stream);
return extractor.getText();
}
default:
log.error(“不支持的文件格式”);
return null;
}
}

基于 Spring Boot + MyBatis Plus + Vue & Element 实现的后台管理系统 + 用户小程序，支持 RBAC 动态权限、多租户、数据权限、工作流、三方登录、支付、短信、商城等功能

项目地址：https://github.com/YunaiV/ruoyi-vue-pro

视频教程：https://doc.iocoder.cn/video/

2.导入依赖包

<dependencies>
  <!-- Apache POI 读取和写入 Microsoft Office 文档 -->
  <dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>5.0.0</version>
  </dependency>
  <dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>5.0.0</version>
  </dependency>

<!– Apache PDFBox 处理 PDF 文件 –>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.24</version>
</dependency>

<!– Apache Tika 自动检测和提取元数据和文本内容 –>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.1.0</version>
</dependency>

<!– iText 处理 PDF 文件 –>
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.5.13</version>
</dependency>
</dependencies>

1.读取pdf

读取 PDF 文件可以使用 Apache PDFBox 库。以下是一个示例代码，用于读取 PDF 文件的文本内容：

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.IOException;

public class PdfReaderExample {
public static void main(String[] args) {
try {
// 1. 加载 PDF 文档
File file = new File(“path_to_your_pdf_file.pdf”);
PDDocument document = PDDocument.load(file);

// 2. 创建 PDFTextStripper 对象，并提取文本内容
PDFTextStripper textStripper = new PDFTextStripper();
String content = textStripper.getText(document);

// 3. 输出文本内容
System.out.println(content);

// 4. 关闭 PDF 文档
document.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

请确保将 path_to_your_pdf_file.pdf 替换为实际的 PDF 文件路径。通过调用 PDDocument.load() 方法加载 PDF 文件，然后创建一个 PDFTextStripper 对象，并使用 getText() 方法提取文本内容。最后，使用 document.close() 方法关闭 PDF 文档。

PDDocument.load() 方法接受多种类型的参数来加载 PDF 文档。以下是常用的参数类型：

File 对象： 可以传递一个 java.io.File 对象，指向要加载的 PDF 文件。例如：PDDocument.load(new File("path_to_your_pdf_file.pdf"))。
文件路径字符串： 可以直接传递一个字符串，表示要加载的 PDF 文件的路径。例如：PDDocument.load("path_to_your_pdf_file.pdf")。
InputStream 对象： 可以传递一个 java.io.InputStream 对象，从中读取 PDF 内容。例如：PDDocument.load(inputStream)。
RandomAccessRead 对象： 可以传递一个 org.apache.pdfbox.io.RandomAccessRead 对象，用于随机访问和读取 PDF 内容。例如：PDDocument.load(randomAccessRead)。

使用不同的参数类型，可以根据你的需求来加载 PDF 文档。请注意，无论使用哪种方式，都需要正确处理可能抛出的 IOException 异常，并在使用完 PDDocument 对象后调用 close() 方法关闭文档以释放资源。

2.读取docx

读取 DOCX 文件，可以使用 Apache POI 库。

import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class DocxReaderExample {
public static void main(String[] args) {
try {
// 1. 加载 DOCX 文档
File file = new File(“path_to_your_docx_file.docx”);
InputStream fis = new FileInputStream(file);
XWPFDocument document = new XWPFDocument(fis);

// 2. 提取文本内容
StringBuilder content = new StringBuilder();
for (XWPFParagraph paragraph : document.getParagraphs()) {
content.append(paragraph.getText());
content.append(“\n”);
}

// 3. 输出文本内容
System.out.println(content.toString());

// 4. 关闭 DOCX 文档
document.close();
fis.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

通过创建一个 FileInputStream 对象，并将其传递给 XWPFDocument 构造函数，来加载 DOCX 文件。然后，通过遍历文档中的段落，使用 getText() 方法提取文本内容，并将其存储在一个 StringBuilder 中。最后，输出文本内容。

提取文本内容，提供另外一种方法。

XWPFDocument document = new XWPFDocument(fis); 
// 2. 提取文本内容 
XWPFWordExtractor extractor = new XWPFWordExtractor(document); 
String text = extractor.getText();

XWPFWordExtractor 是 Apache POI 库中的一个类，用于从 XWPFDocument 对象中提取文本。

然后，调用 getText() 方法，通过 extractor 对象提取文本内容。该方法会返回一个包含整个文档纯文本的字符串。

3.读取doc

读取 DOC（.doc）文件，可以使用 Apache POI 库中的 HWPF 模块

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

public class DocTextExtractor {
public static String extractTextFromDoc(String filePath) {
try {
// 1. 加载 DOC 文档
File file = new File(filePath);
FileInputStream fis = new FileInputStream(file);
HWPFDocument document = new HWPFDocument(fis);

// 2. 提取文本内容
WordExtractor extractor = new WordExtractor(document);
String text = extractor.getText();

// 3. 关闭 DOC 文档和提取器
extractor.close();
document.close();
fis.close();

// 4. 返回提取的文本内容
return text;
} catch (IOException e) {
e.printStackTrace();
}
return null;
}

public static void main(String[] args) {
String filePath = “path_to_your_doc_file.doc”;
String extractedText = extractTextFromDoc(filePath);
System.out.println(extractedText);
}
}

4.读取Excel

1.使用 Apache POI 库读取 Excel 文件

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

public class ExcelReader {

public static void main(String[] args) throws IOException {
File file = new File(“path/to/excel/file”);
FileInputStream inputStream = new FileInputStream(file);
XSSFWorkbook workbook = new XSSFWorkbook(inputStream);
Sheet sheet = workbook.getSheetAt(0);
for (Row row : sheet) {
for (Cell cell : row) {
System.out.print(cell.toString() + “\t”);
}
System.out.println();
}
workbook.close();
}
}

首先创建了一个 File 对象来表示要读取的 Excel 文件，然后创建了一个 FileInputStream 对象来读取文件。接着，我们使用 XSSFWorkbook 类创建了一个 workbook 对象来表示整个 Excel 文档，并获取了第一个工作表（即索引为 0 的工作表）。

在循环中，我们首先遍历每一行 (Row)，然后再遍历每一列 (Cell)。我们可以使用 cell.toString() 方法获取单元格的值，并打印出来。最后，我们调用 workbook.close() 方法关闭工作簿，释放资源。

2.使用easyExcel

EasyExcel 是一款开源的 Java Excel 操作工具，它提供了简单易用的 API 来读取、写入和操作 Excel 文件。

<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>easyexcel</artifactId>
    <version>2.4.3</version>
</dependency>

读取excel文件

import com.alibaba.excel.EasyExcel;
import com.alibaba.excel.read.builder.ExcelReaderBuilder;
import com.alibaba.excel.read.listener.ReadListener;

public class ExcelReader {
public static void main(String[] args) {
String filePath = “path_to_your_excel_file.xlsx”;

// 创建 Excel 读取器
ExcelReaderBuilder readerBuilder = EasyExcel.read(filePath);

// 注册读取监听器
ReadListener<Object> listener = new YourReadListener();
readerBuilder.registerReadListener(listener);

// 执行读取操作
readerBuilder.sheet().doRead();
}
}

通过 EasyExcel.read(filePath) 创建了一个 Excel 读取器，然后通过 registerReadListener() 方法注册了一个读取监听器，你需要自己实现一个 ReadListener 的子类，并在其中重写相应的方法来处理读取到的数据。最后，通过 sheet().doRead() 方法执行读取操作。

原创文章，作者：guozi，如若转载，请注明出处：https://www.sudun.com/ask/90187.html

如何使用 Java 读取 Excel、docx、pdf 和 txt 文件？

1.举个栗子

2.导入依赖包

1.读取pdf

2.读取docx

3.读取doc

4.读取Excel

相关推荐

如何利用江苏seo提升网站排名？

linux 重启命令，linux的重启命令有哪些

跨越防火墙浏览外国网站犯法吗，防火墙屏蔽了网站怎么打开

云服务器密码忘记怎么办？最简单的解决方法

发表回复

Please sign in