textractor

从html文本中提取标题,正文,图片,作者,时间等信息,适用于新闻类网页

安装

    go get github.com/gloomyzerg/textractor

使用

package main

import (
    "io/ioutil"
	"log"
	"net/http"

	"github.com/gloomyzerg/textractor"
)

func main(){
    url := "http://www.xxx.com/xxx"
	resp, err := http.Get(url)
	if err != nil {
		log.Fatal(err)
	}
	source, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		log.Fatal(err)
    }
    // 这只是一个例子
    // textractor.Extract 接收一个html的字符串
    // 可根据需求自行选择如何获取一个html字符串
    // 例如带分页的页面,可自行获取所有分页内容,拼接后传入
    result, _ := textractor.Extract(string(source))
    fmt.Printf("%+v", result)
}

命令行使用

    go get -u github.com/gloomyzerg/textractor/cmd/...

    textractor [url]

说明

textractor使用的《基于文本及符号密度的网页正文提取方法》对于一般的中文新闻类网页有较高的准确率,根据论文结论可知准确率高达99%以上.但由于样本条件限制作者并未测试足够多的样本来验证准确率.
由于网页代码的多样性,任何提取算法都不可能覆盖所有网页.如遇到不能正确提取的网页,欢迎在issue中留下网页地址,具体问题具体分析.作者尽可能的去完善,以覆盖更多的页面.

textractor 命令行是为了方便测试和调试使用, 只是简单的 wget + extract , 并不能解析由js生成的动态页面, 动态页面可自行选择使用合适的解析办法.

感谢

本项目受到 github.com/kingname/GeneralNewsExtractor 的启发,并参考使用了它的测试用例用进行开发和测试

Name	Name	Last commit message	Last commit date
Latest commit kwaziidev chore(go.mod): fix go module name Aug 31, 2021 2862114 · Aug 31, 2021 History 14 Commits
cmd/textractor	cmd/textractor	first commit	Feb 26, 2020
.gitignore	.gitignore	first commit	Feb 26, 2020
LICENSE	LICENSE	Initial commit	Feb 26, 2020
Makefile	Makefile	chore(go.mod): fix go module name	Aug 31, 2021
README.md	README.md	first commit	Feb 26, 2020
author.go	author.go	[IMP] precision optimization	Jul 24, 2021
author_test.go	author_test.go	test(author): add test	Feb 27, 2020
content.go	content.go	[IMP] precision optimization	Jul 24, 2021
extractor.go	extractor.go	[IMP] precision optimization	Jul 24, 2021
extractor_test.go	extractor_test.go	perf(extractor): 改为异步处理	Feb 27, 2020
go.mod	go.mod	chore(go.mod): fix go module name	Aug 31, 2021
go.sum	go.sum	first commit	Feb 26, 2020
publish_time.go	publish_time.go	[IMP] precision optimization	Jul 24, 2021
publish_time_test.go	publish_time_test.go	test(publish_time): add test	Feb 27, 2020
std.go	std.go	first commit	Feb 26, 2020
std_test.go	std_test.go	test(std): add test case	Feb 27, 2020
title.go	title.go	[IMP] precision optimization	Jul 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

textractor

安装

使用

命令行使用

说明

感谢

About

Releases 2

Packages

Contributors 2

Languages

License

kwaziidev/textractor

Folders and files

Latest commit

History

Repository files navigation

textractor

安装

使用

命令行使用

说明

感谢

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages