Paul's profileAutoSpongePhotosBlogLists Tools Help

Blog


    March 31

    Search Index and Document Size

    Enterprise Search works great.  But some people speculate that their larger documents will not index completely because Search only looks at the first 64k of data.  I checked an online resource at Berkley concerning document size and then performed an experiment to validate this concern.

    The Berkley project found that "office documents" averaged 2.5k/page of plain text.  For comparison, I opened Word and typed =rand(9).  This function randomly generates nine paragraphs, approximately one page, of text (pulled from various Office help files).  I copied the text into Notepad and saved the file--3.18k.

    3.2k/page as a conservative average means that we only index the first 20 pages of any document.  Before I make a prediction about a real document under real conditions, I need to consider the noiseword filter.  Search has a list of noisewords for each language like this one for English-US:  C:\\Program Files\Microsoft Office Servers\12.0\Data\Office Server\Applications\<GUID>\Config\noiseenu.txt

    So, I created a random Word document again but this time performed a find-and-replace on each of the English noise words:  a, and, is, in, it, of, the, to.  I replaced the following:  a(9), and(6), is(0), in(9), it(0), of(15), the(63), to(15).  The new file weighs 2.89k (a difference of 0.29k, or ~10%).  This means we can expect to index 22 pages per document instead of 20.

    If you have a lot of large documents, you may want to sample them and add to the noiseword file.  You can ignore industry jargon and increase the ~20-page threshold slightly.  Also, make sure every large document contains a table of contents--Search will index that first.  This strategy, along with Best Bets and Keywords, will help improve search relevancy.  However, if your organization has a large percentage of similar 25+page documents, you may need to consider a divide-and-conquer approach to enhance search results.

    In SharePoint, documents can be "assembled" from various parts using a Content Query Web Part to display the chapters of an Employee Handbook, for instance.  If users need to search that entire document offline, you may need to create an Office script to pull the chapters from SharePoint into a single .docx file (although splitting the document back up could prove problematic).

    Comments (3)

    Please wait...
    Sorry, the comment you entered is too long. Please shorten it.
    You didn't enter anything. Please try again.
    Sorry, we can't add your comment right now. Please try again later.
    To add a comment, you need permission from your parent. Ask for permission
    Your parent has turned off comments.
    Sorry, we can't delete your comment right now. Please try again later.
    You've exceeded the maximum number of comments that can be left in one day. Please try again in 24 hours.
    Your account has had the ability to leave comments disabled because our systems indicate that you may be spamming other users. If you believe that your account has been disabled in error please contact Windows Live support.
    Complete the security check below to finish leaving your comment.
    The characters you type in the security check must match the characters in the picture or audio.

    To add a comment, sign in with your Windows Live ID (if you use Hotmail, Messenger, or Xbox LIVE, you have a Windows Live ID). Sign in


    Don't have a Windows Live ID? Sign up

    Paulwrote:
    I just saw this blog about the Autosumamrize feature in Word 2007.  This is a great idea to help users find the right document with MOSS Search.
    Apr. 2
    Paulwrote:
    Hi James,
     
    The reason I put "assembled" in quotes like that is because the CQWP just aggregates the Content Type "Employee Handbook" from various libraries in the site collection and displays them as a list of related documents.  In that situation, you can use either metadata or the title ("01-Ethics, 02-Reporting...") to sort the chapters.
     
    I would need to research construction of a document for offline viewing or printing before I gave a recommendation but my first approach would create the full document as a PDF by running a conversion workflow.  The Employee Handbook chapter Content Type would have a workflow that triggers file conversion any time an individual document publishes a major version.  If a third-party tool already exists for that, it would not surprise me.  Other techniques could leverage the Word DOM and VBA or custom web controls.
     
    More food for thought:  another knock against MOSS, relating to documents and search, surrounds the user experience of having to open the target document in the client application and run Find (CTRL+F), rinse and repeat until they find what they need.  By chunking documents, the user gets better search results and has less to download/open each time they select a document. 
     
    On the client side, Word has less to save during autosave--something users usually turn off while working with large documents.  Also, more individual contributors can work on the same document as parts when normally only one could have it checked out.
     
    While chunking large docs is by no means a panacea, the performance gains may ease user frustration.
    Apr. 1
    Picture of Anonymous
    James wrote:
    Hi. Spookily enough, I happen to be about to start an Employee Handbook Sharepoint project and read your last paragraph with particular interest. I don't suppose you could elaborate on how exactly you "assemble" documents using the CQWP? Are you proposing the use of "Chapter" and perhaps "Policy Number" Site Columns which are then used in the scripted generation of the full handbook for full offline viewing? If I'm interpreting your thoughts correctly a HR admin would upload an individual policy in to a library specifying which Chapter it belongs to and it's Policy number. Then user's can either view each policy as a separate document or download the dynamically generated full document. Is that the type of thing you are talking about?
     
    Do you have any sample code for this document generation or is it fairly standard SharePoint stuff. I'm fairly new to SharePoint development so I hope I'm making sense.
    Apr. 1

    Trackbacks

    The trackback URL for this entry is:
    http://autosponge.spaces.live.com/blog/cns!D7F85948C20F0293!344.trak
    Weblogs that reference this entry
    • None