![]() A higher setting also makes the process run faster because fewer comparisons have to be made. ![]() A higher setting requires more similarity and generally results in smaller groups. A value of 100% would indicate an exact textual duplicate. This parameter indicates how similar a document must be to a principal document to be placed into that principal's group. The Minimum Similarity Percentage parameter controls how the task works. Textual near duplicate identification results.Documents that are not textually similar to any other documents in your analysis group, based on the minimum similarity percentage chosen, end up as “standalone” documents that do not belong to a near duplicate group.Documents that have the Textual Near Duplicate Group field set to empty or numbers-only are also grouped together. When the process is complete, only principal documents that have one or more near duplicates are shown in groups.Documents that only contain numbers or that do not contain text will have the Textual Near Duplicate Group field set to numbers-only or empty, respectively. Note: Analyzed documents that are not textually similar enough to any other documents will not have fields populated for Textual Near Duplicate Principal or Textual Near Duplicate Group. If no current groups are matches, the current document becomes a new principal document. If the current document is a close enough match to the principal document-as defined by the Minimum Similarity Percentage-it is placed in that group. The principal document is the largest document in a group and is the document that all others are compared to when determining whether they are near duplicates. The most visible optimization and organizing notion is the principal document. This is also the order in which they are processed. ![]() The documents are sorted by the amount of text in the field being analyzed, in order from largest to smallest.White space and punctuation characters are also ignored, except to identify word and sentence boundaries. The task operates on text only (which has been converted to lowercase). It scans the text and saves various statistics for later use.This defaults to the Extracted Text field, but you can change it under Select field to analyze when setting up the structured analytics set. It takes the contents of all documents with 30 MB or less of text in the field you choose to analyze.The following is a simplified explanation of this process: ![]() While textual near duplicate identification is simple to understand, the implementation is complex and relies on several optimizations so that results can be delivered in a reasonable amount of time. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |