补充:一下代码基于maven,现将依赖的jar包单独导出
地址:
也就两个文件
java读取pdf中的纯文字,这里使用的是pdfbox工具包
maven引入如下配置
net.sf.cssbox pdf2dom 1.7 org.apache.pdfbox pdfbox 2.0.12 org.apache.pdfbox pdfbox-tools 2.0.12
工具类直接读取
代码示例
/* 读取pdf文字 */ @Test public void readPdfTextTest() throws IOException { byte[] bytes = getBytes("D:\\code\\pdf\\HashMap.pdf"); //加载PDF文档 PDDocument document = PDDocument.load(bytes); readText(document); } public void readText(PDDocument document) throws IOException { PDFTextStripper stripper = new PDFTextStripper(); String text = stripper.getText(document); System.out.println(text); }
将pdf转换为html
效果图
代码示例
/* pdf转换html */ @Test public void pdfToHtmlTest() { String outputPath = "D:\\code\\pdf\\HashMap.html"; byte[] bytes = getBytes("D:\\code\\pdf\\HashMap.pdf");// try() 写在()里面会自动关闭流 try (BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(outputPath)),"UTF-8"));){ //加载PDF文档 PDDocument document = PDDocument.load(bytes); PDFDomTree pdfDomTree = new PDFDomTree(); pdfDomTree.writeText(document,out); } catch (Exception e) { e.printStackTrace(); } } /* 将文件转换为byte数组 */ private byte[] getBytes(String filePath){ byte[] buffer = null; try { File file = new File(filePath); FileInputStream fis = new FileInputStream(file); ByteArrayOutputStream bos = new ByteArrayOutputStream(1000); byte[] b = new byte[1000]; int n; while ((n = fis.read(b)) != -1) { bos.write(b, 0, n); } fis.close(); bos.close(); buffer = bos.toByteArray(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return buffer; }
完整的一个上传pdf转换为HTML功能(今后转换pdf也不需要找什么第三方了,哈哈)
@RequestMapping("ud")@Controllerpublic class UpAndDownController { @RequestMapping("upload.do") @ResponseBody public Mapupload(@RequestParam("file") MultipartFile file, HttpServletRequest request){ Map map = new HashMap<>(); map.put("code","200"); try { PdfConvertUtil pdfConvertUtil = new PdfConvertUtil(); String pdfName = file.getOriginalFilename(); int lastIndex = pdfName.lastIndexOf(".pdf"); String fileName = pdfName.substring(0, lastIndex); String htmlName = fileName + ".html"; String realPath = ResourceUtils.getURL("classpath:").getPath() + "/templates/file"; File f = new File(realPath); if(!f.exists()){ f.mkdirs(); } String htmlPath = realPath + "\\" + htmlName; pdfConvertUtil.pdftohtml(file.getBytes(), htmlPath); } catch (Exception e) { map.put("code","500"); e.printStackTrace(); } return map; }}
可以使用postman调试
需要设置请求头 Content-Type 指定为 application/x-www-form-urlencoded
之后选择body选择form-data,OK
如果涉及到HTML页面直接加载PDF,无需插件
可以参考下