Skip to content

Tika内容提取模块

Zhang Baofeng edited this page Jun 11, 2013 · 1 revision

工作简述

Tika的提取pdf全文这部分工作,在刚开始的时候我提取了上千篇论文,然后存在了mongodb里,后来建立了索引提供了全文搜索。但是一方面,全文搜索对于学术搜索来说意义不大,是个比较鸡肋的功能;另一方面,尝试Tika来结构化论文内容的时候,不是对所有的pdf结构化提取都有效,效果也不是很理想,所以之后Tika这部分我没有继续开发下去。下面我会说明下我使用Tika的一些内容。

Tika内容提取工作

在dcd.academic.util里有个专门的TikaUtil.java工具类,提供了两种提取pdf内容的方法,一种是传入pdf本地路径,一种是传入pdf的InputStream(可以用于从GridFS里读出pdf等情况)。

package dcd.academic.util;

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

public class TikaUtil {

	public static String getPdfContentByPath(String path) throws SAXException, TikaException {
		Parser parser = new PDFParser();
		try {
			InputStream iStream = new BufferedInputStream(new FileInputStream( new File(path)));
			ContentHandler iHandler = new BodyContentHandler();

			Metadata meta = new Metadata();
			meta.add(Metadata.CONTENT_ENCODING, "utf-8");
			parser.parse(iStream, iHandler, meta, new ParseContext());
			iHandler.startDocument();
			iHandler.endDocument();

			return iHandler.toString();
		} catch (IOException e) {
			System.out.println("Not full pdf");
			return "";
		}
	}

	public static String getPdfContentByStream(InputStream is) throws SAXException, TikaException {
		Parser parser = new PDFParser();
		try {
			ContentHandler iHandler = new BodyContentHandler();

			Metadata meta = new Metadata();
			meta.add(Metadata.CONTENT_ENCODING, "utf-8");
			parser.parse(is, iHandler, meta, new ParseContext());
			iHandler.startDocument();
			iHandler.endDocument();

			return iHandler.toString();
		} catch (IOException e) {
			System.out.println("Not full pdf");
			return "";
		}
	}
}

在dcd.academic.data里PdfAnalyzer尝试进行了pdf的全文提取和结构化工作,通过Tika逐行读取结果,分析Abstract,Introduction,Conclusions,References等词的出现,来定义论文的结构。方法很简单,实现也很简单。

package dcd.academic.data;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.xml.sax.SAXException;

import dcd.academic.model.PaperModel;

public class PdfAnalyzer {
	public PaperModel extractPaperModel(InputStream stream) throws SAXException, TikaException {
		PaperModel pm = null;
		Metadata meta = new Metadata();
		meta.add(Metadata.CONTENT_ENCODING, "utf-8");

		boolean isAbstr = false;
		boolean isIntro = false;
		boolean isConcl = false;
		boolean isRefer = false;
		String head = "";
		String abstr = "";
		String content = "";
		String conclu = "";
		String refer = "";
		try {
			BufferedReader reader = new BufferedReader(new Tika().parse(stream,
					meta), 1024 * 2);
			String lineString = null;

			while ((lineString = reader.readLine()) != null) {
				if (lineString.contains("Abstract")) {
					isAbstr = true;
				} else if (lineString.contains("Introduction")) {
					isIntro = true;
				} else if (lineString.contains("Conclusions") || lineString.contains("Conclusion")) {
					isConcl = true;
				} else if (lineString.contains("References")) {
					isRefer = true;
				}

				if (isRefer) {
					refer = refer + " " + lineString;
				} else if (isConcl && !isRefer) {
					conclu = conclu + " " + lineString;
				} else if (isIntro && !isConcl) {
					content = content + " " + lineString;
				} else if (isAbstr && !isIntro) {
					abstr = abstr + " " + lineString;
				} else if (!isAbstr) {
					head = head + " " + lineString;
				}
			}
			pm = new PaperModel();
			pm.setHead(head);
			pm.setAbstrct(abstr);
			pm.setContent(content);
			pm.setConclu(conclu);
			pm.setRefers(refer);
			return pm;
		} catch (IOException e) {
			e.printStackTrace();
			System.out.println("Not full pdf");
			return null;
		}
	
	}
}

如果二次开发想要使用Tika结构化pdf内容,还是比较可行的,可以再尝试深入;想要Tika来提取全部文本内容,那参考TikaUtil里的两种读取方法就可以了。Tika-app.jar包其实是可以用java -jar 来启动的,自带了界面化。