-
Notifications
You must be signed in to change notification settings - Fork 57
Tika内容提取模块
Zhang Baofeng edited this page Jun 11, 2013
·
1 revision
Tika的提取pdf全文这部分工作,在刚开始的时候我提取了上千篇论文,然后存在了mongodb里,后来建立了索引提供了全文搜索。但是一方面,全文搜索对于学术搜索来说意义不大,是个比较鸡肋的功能;另一方面,尝试Tika来结构化论文内容的时候,不是对所有的pdf结构化提取都有效,效果也不是很理想,所以之后Tika这部分我没有继续开发下去。下面我会说明下我使用Tika的一些内容。
在dcd.academic.util里有个专门的TikaUtil.java工具类,提供了两种提取pdf内容的方法,一种是传入pdf本地路径,一种是传入pdf的InputStream(可以用于从GridFS里读出pdf等情况)。
package dcd.academic.util;
import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
public class TikaUtil {
public static String getPdfContentByPath(String path) throws SAXException, TikaException {
Parser parser = new PDFParser();
try {
InputStream iStream = new BufferedInputStream(new FileInputStream( new File(path)));
ContentHandler iHandler = new BodyContentHandler();
Metadata meta = new Metadata();
meta.add(Metadata.CONTENT_ENCODING, "utf-8");
parser.parse(iStream, iHandler, meta, new ParseContext());
iHandler.startDocument();
iHandler.endDocument();
return iHandler.toString();
} catch (IOException e) {
System.out.println("Not full pdf");
return "";
}
}
public static String getPdfContentByStream(InputStream is) throws SAXException, TikaException {
Parser parser = new PDFParser();
try {
ContentHandler iHandler = new BodyContentHandler();
Metadata meta = new Metadata();
meta.add(Metadata.CONTENT_ENCODING, "utf-8");
parser.parse(is, iHandler, meta, new ParseContext());
iHandler.startDocument();
iHandler.endDocument();
return iHandler.toString();
} catch (IOException e) {
System.out.println("Not full pdf");
return "";
}
}
}
在dcd.academic.data里PdfAnalyzer尝试进行了pdf的全文提取和结构化工作,通过Tika逐行读取结果,分析Abstract,Introduction,Conclusions,References等词的出现,来定义论文的结构。方法很简单,实现也很简单。
package dcd.academic.data;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.xml.sax.SAXException;
import dcd.academic.model.PaperModel;
public class PdfAnalyzer {
public PaperModel extractPaperModel(InputStream stream) throws SAXException, TikaException {
PaperModel pm = null;
Metadata meta = new Metadata();
meta.add(Metadata.CONTENT_ENCODING, "utf-8");
boolean isAbstr = false;
boolean isIntro = false;
boolean isConcl = false;
boolean isRefer = false;
String head = "";
String abstr = "";
String content = "";
String conclu = "";
String refer = "";
try {
BufferedReader reader = new BufferedReader(new Tika().parse(stream,
meta), 1024 * 2);
String lineString = null;
while ((lineString = reader.readLine()) != null) {
if (lineString.contains("Abstract")) {
isAbstr = true;
} else if (lineString.contains("Introduction")) {
isIntro = true;
} else if (lineString.contains("Conclusions") || lineString.contains("Conclusion")) {
isConcl = true;
} else if (lineString.contains("References")) {
isRefer = true;
}
if (isRefer) {
refer = refer + " " + lineString;
} else if (isConcl && !isRefer) {
conclu = conclu + " " + lineString;
} else if (isIntro && !isConcl) {
content = content + " " + lineString;
} else if (isAbstr && !isIntro) {
abstr = abstr + " " + lineString;
} else if (!isAbstr) {
head = head + " " + lineString;
}
}
pm = new PaperModel();
pm.setHead(head);
pm.setAbstrct(abstr);
pm.setContent(content);
pm.setConclu(conclu);
pm.setRefers(refer);
return pm;
} catch (IOException e) {
e.printStackTrace();
System.out.println("Not full pdf");
return null;
}
}
}
如果二次开发想要使用Tika结构化pdf内容,还是比较可行的,可以再尝试深入;想要Tika来提取全部文本内容,那参考TikaUtil里的两种读取方法就可以了。Tika-app.jar包其实是可以用java -jar 来启动的,自带了界面化。