| Paul's profileAutoSpongePhotosBlogLists | Help |
|
|
March 31 Search Index and Document SizeEnterprise Search works great. But some people speculate that their larger documents will not index completely because Search only looks at the first 64k of data. I checked an online resource at Berkley concerning document size and then performed an experiment to validate this concern. The Berkley project found that "office documents" averaged 2.5k/page of plain text. For comparison, I opened Word and typed =rand(9). This function randomly generates nine paragraphs, approximately one page, of text (pulled from various Office help files). I copied the text into Notepad and saved the file--3.18k. 3.2k/page as a conservative average means that we only index the first 20 pages of any document. Before I make a prediction about a real document under real conditions, I need to consider the noiseword filter. Search has a list of noisewords for each language like this one for English-US: C:\\Program Files\Microsoft Office Servers\12.0\Data\Office Server\Applications\<GUID>\Config\noiseenu.txt So, I created a random Word document again but this time performed a find-and-replace on each of the English noise words: a, and, is, in, it, of, the, to. I replaced the following: a(9), and(6), is(0), in(9), it(0), of(15), the(63), to(15). The new file weighs 2.89k (a difference of 0.29k, or ~10%). This means we can expect to index 22 pages per document instead of 20. If you have a lot of large documents, you may want to sample them and add to the noiseword file. You can ignore industry jargon and increase the ~20-page threshold slightly. Also, make sure every large document contains a table of contents--Search will index that first. This strategy, along with Best Bets and Keywords, will help improve search relevancy. However, if your organization has a large percentage of similar 25+page documents, you may need to consider a divide-and-conquer approach to enhance search results. In SharePoint, documents can be "assembled" from various parts using a Content Query Web Part to display the chapters of an Employee Handbook, for instance. If users need to search that entire document offline, you may need to create an Office script to pull the chapters from SharePoint into a single .docx file (although splitting the document back up could prove problematic). Comments (3)
TrackbacksThe trackback URL for this entry is: http://autosponge.spaces.live.com/blog/cns!D7F85948C20F0293!344.trak Weblogs that reference this entry
|
|
|