600字范文 > 北大天网搜索引擎TSE分析及完全注释[5]倒排索引的建立及文件介绍

北大天网搜索引擎TSE分析及完全注释[5]倒排索引的建立及文件介绍

时间：2021-07-01 13:58:41

不好意思让大家久等了，前一阵一直在忙考试，终于结束了。呵呵！废话不多说了下面我们开始吧！

TSE用的是将抓取回来的网页文档全部装入一个大文档，让后对这一个大文档内的数据整体统一的建索引，其中包含了几个步骤。

view plain copy to clipboard print ? 1.Thedocumentindex(Doc.idx)keepsinformationabouteachdocument. ItisafixedwidthISAM(Indexsequentialaccessmode)index,orderdbydocID. Theinformationstoredineachentryincludesapointerintotherepository, adocumentlength,adocumentchecksum. //Doc.idx文档编号文档长度checksumhash码 00bc9ce846d7987c4534f53d423380ba70 1767604f47a3cad91f7d35f4bb6b2a638420e5 2141624d019433008538f65329ae8e39b86026c 31423505705b8f58110f9ad61b1321c52605795 //Doc.idxend Theurlindex(url.idx)isusedtoconvertURLsintodocIDs. //url.idx 5c36868a9c5117eadbda747cbdb0725f03272e136dd90263ee306a835c6c70d7716b8601bb3bb9ab80f868d549b5c5a5f323f9eba99fa788954b5ff7f35a5db6e1f3//url.idxend ItisalistofURLchecksumswiththeircorrespondingdocIDsandissortedby checksum.InordertofindthedocIDofaparticularURL,theURL'schecksum iscomputedandabinarysearchisperformedonthechecksumsfiletofindits docID. ./DocIndex gotDoc.idx,Url.idx,DocId2Url.idx//Data文件夹中的Doc.idxDocId2Url.idx和Doc.idx中 //DocId2Url.idx 0http://*.*./index.aspx 1http://*.*./showcontent1.jsp?NewsID=118 2http://*.*./0102.html 3http://*.*./0103.html //DocId2Url.idxend 2.sortUrl.idx|uniq>Url.idx.sort_uniq//Data文件夹中的Url.idx.sort_uniq //Url.idx.sort_uniq //对hash值进行排序 000bfdfd8b2dedd926b58ba00d40986b1111000c7e34b653b5135a2361c6818e48dc18310019d12f438eec910a06a606f570fde83660033f7c005ec776f67f496cd8bc4ae0d21033.Segmentdocumenttoterms,(withfindingdocumentaccordingtotheurl) ./DocSegmentTianwang.raw.2559638448//Tianwang.raw.2559638448为爬回来的文件，每个页面包含http头 gotTianwang.raw.2559638448.seg //Tianwang.raw.2559638448爬取的原始网页文件在文档内部每一个文档之间应该是通过version，</html>和回车做标志位分割的 version:1.0url:http://***.105.138.175/Default2.asp?lang=gb origin:http://***.105.138.175/ date:Fri,23May20:01:36GMT ip:162.105.138.175length:38413HTTP/1.1200OK Server:Microsoft-IIS/5.0Date:Fri,23May11:17:49GMT Connection:keep-alive Connection:Keep-Alive Content-Length:38088Content-Type:text/html;Charset=gb2312 Expires:Fri,23May11:17:49GMT Set-Cookie:ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH;path=/ Cache-control:private<!DOCTYPEHTMLPUBLIC"-//W3C//DTDHTML4.01Transitional//EN""/TR/html4/loose.dtd"> <html> <head> <title>Apabi数字资源平台</title> <metahttp-equiv="Content-Type"content="text/html;charset=gb2312"> <METANAME="ROBOTS"CONTENT="INDEX,NOFOLLOW"> <METANAME="DESCRIPTION"CONTENT="数字图书馆方正数字图书馆电子图书电子书ebooke书Apabi数字资源平台"> <linkrel="stylesheet"type="text/css"href="css/common.css"> <styletype="text/css">  </style> <scriptLANGUAGE="vbscript"> ... </script> <ScriptLanguage="javascript"> ... </Script> </head> <bodyleftmargin="0"topmargin="0"> </body> </html> //Tianwang.raw.2559638448end //Tianwang.raw.2559638448.seg将每个页面分成一行如下(注意中间没有回车作为分隔) 1... ... ... 2... ... ... //Tianwang.raw.2559638448.segend //下是Tinysearch非必须因素 4.Createforwardindex(docic-->termid)//建立正向索引 ./CrtForwardIdxTianwang.raw.2559638448.seg>moon.fidx //Tianwang.raw.2559638448.seg将每个页面分成一行如下<BR>//分词DocID<BR>1<BR>三星/s/手机/论坛/,/手机/铃声/下载/,/手机/图片/下载/,/手机/<BR>2<BR>...<BR>...<BR>...

1. The document index (Doc.idx) keeps information about each document.It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.The information stored in each entry includes a pointer into the repository,a document length, a document checksum.//Doc.idx 文档编号文档长度checksum hash码00bc9ce846d7987c4534f53d423380ba701767604f47a3cad91f7d35f4bb6b2a638420e52141624d019433008538f65329ae8e39b86026c31423505705b8f58110f9ad61b1321c52605795//Doc.idxendThe url index (url.idx) is used to convert URLs into docIDs.//url.idx5c36868a9c5117eadbda747cbdb0725f03272e136dd90263ee306a835c6c70d7716b8601bb3bb9ab80f868d549b5c5a5f323f9eba99fa788954b5ff7f35a5db6e1f3//url.idxendIt is a list of URL checksums with their corresponding docIDs and is sorted bychecksum. In order to find the docID of a particular URL, the URL's checksumis computed and a binary search is performed on the checksums file to find itsdocID../DocIndexgot Doc.idx, Url.idx, DocId2Url.idx//Data文件夹中的Doc.idx DocId2Url.idx和Doc.idx中//DocId2Url.idx0http://*.*./index.aspx1http://*.*./showcontent1.jsp?NewsID=1182http://*.*./0102.html3http://*.*./0103.html//DocId2Url.idxend2. sort Url.idx|uniq > Url.idx.sort_uniq//Data文件夹中的Url.idx.sort_uniq//Url.idx.sort_uniq//对hash值进行排序000bfdfd8b2dedd926b58ba00d40986b1111000c7e34b653b5135a2361c6818e48dc18310019d12f438eec910a06a606f570fde83660033f7c005ec776f67f496cd8bc4ae0d21033. Segment document to terms, (with finding document according to the url)./DocSegment Tianwang.raw.2559638448//Tianwang.raw.2559638448为爬回来的文件，每个页面包含http头got Tianwang.raw.2559638448.seg//Tianwang.raw.2559638448爬取的原始网页文件在文档内部每一个文档之间应该是通过version，</html>和回车做标志位分割的version: 1.0url: http://***.105.138.175/Default2.asp?lang=gborigin: http://***.105.138.175/date: Fri, 23 May 20:01:36 GMTip: 162.105.138.175length: 38413HTTP/1.1 200 OKServer: Microsoft-IIS/5.0Date: Fri, 23 May 11:17:49 GMTConnection: keep-aliveConnection: Keep-AliveContent-Length: 38088Content-Type: text/html; Charset=gb2312Expires: Fri, 23 May 11:17:49 GMTSet-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/Cache-control: private<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN""/TR/html4/loose.dtd"><html><head><title>Apabi数字资源平台</title><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW"><META NAME="DESCRIPTION" CONTENT="数字图书馆方正数字图书馆电子图书电子书 ebook e书 Apabi 数字资源平台"><link rel="stylesheet" type="text/css" href="css/common.css"><style type="text/css"></style><script LANGUAGE="vbscript">...</script><Script Language="javascript">...</Script></head><body leftmargin="0" topmargin="0"></body></html>//Tianwang.raw.2559638448end//Tianwang.raw.2559638448.seg将每个页面分成一行如下(注意中间没有回车作为分隔)1.........2.........//Tianwang.raw.2559638448.segend//下是 Tiny search 非必须因素4. Create forward index (docic-->termid)//建立正向索引./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx//Tianwang.raw.2559638448.seg将每个页面分成一行如下

//分词 DocID

1

三星/ s/ 手机/ 论坛/ ,/ 手机/ 铃声/ 下载/ ,/ 手机/ 图片/ 下载/ ,/ 手机/

2

...

view plain copy to clipboard print ? //Tianwang.raw.2559638448.segend //moon.fidx //每篇文档号对应文档内分出来的分词DocID 都会2391使2391那些2391拥有2391它2391的2391人2391的2391视野2391变2391窄2391在2180研究生部2180主页2180培养2180管理2180栏目2180下载2180）2180、2180关于2180做好2180年2180国家2180公派2180研究生2180项目2180//moon.fidxend 5.#set|grep"LANG"LANG=en;exportLANG; sortmoon.fidx>moon.fidx.sort 6.Createinvertedindex(termid-->docid)//建立倒排索引 ./CrtInvertedIdxmoon.fidx.sort>sun.iidx //sun.iidx//文件规模大概减少1/2 花工236花海2103花卉1018106110611061173017301730173017301852949949花蕾447447花木1061花呢1430花期447447447447447525花钱174236花色17301730花色品种1660花生450526花式1428143014301430花纹14301430花序447447447447447450花絮136137花芽450450//sun.iidxend TSESearchCGIprogramforquery SnapshotCGIprogramforpagesnapshot

//Tianwang.raw.2559638448.segend//moon.fidx//每篇文档号对应文档内分出来的分词DocID都会2391使2391那些2391拥有2391它2391的2391人2391的2391视野2391变2391窄2391在2180研究生部2180主页2180培养2180管理2180栏目2180下载2180）2180、2180关于2180做好2180年2180国家2180公派2180研究生2180项目2180//moon.fidxend5.# set | grep "LANG"LANG=en; export LANG;sort moon.fidx > moon.fidx.sort6. Create inverted index (termid-->docid)//建立倒排索引./CrtInvertedIdx moon.fidx.sort > sun.iidx//sun.iidx//文件规模大概减少1/2花工 236花海 2103花卉 1018 1061 1061 1061 1730 1730 1730 1730 1730 1852 949 949花蕾 447 447花木 1061花呢 1430花期 447 447 447 447 447 525花钱 174 236花色 1730 1730花色品种 1660花生 450 526花式 1428 1430 1430 1430花纹 1430 1430花序 447 447 447 447 447 450花絮 136 137花芽 450 450//sun.iidxendTSESearchCGI program for querySnapshotCGI program for page snapshot

author:/jrckkyyauthor:/jrckkyy

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。