Please take a look at the following code snippet. If You set redundant reducing of grid width for the document (that doesn’t need in it), the extracted text content will remain fully adequate. However, you must not determine whether scaling is necessary for concrete documents or not. Or manually set redundant reducing of grid width ( about ScaleFactor = 0.5). We propose the usage of auto-scaling (ScaleFactor = 0) when processing a large number of PDF files for text content extraction. If the specified ScaleFactor value is more than 10 or less than -0.1, the default value of 1.0 will be used. Now, in order to start extracting text, first of all, you need to call ExtractText method this will extract the text from the PDF file and will store it into. Please note that if ScaleFactor value is not specified, the default value of 1.0 will be used. The calculation is based on average glyph width of the most popular font on the page, but we cannot guarantee that in extracted text no string of column reaches the start of the next column. Specifying the ScaleFactor values between 0.1 and -0.1 is treated as zero value, but it makes the algorithm to calculate scale factor needed during extracting text automatically. Specifying the ScaleFactor values between 1 and 0.1 (including 0.1) has the same effect as font reduction. To extract TextrFrom All the Pages Pdf document using Aspose.PDF Java for Python, simply invoke ExtractTextFromAllPages module. This scale factor may be set to adjust the grid which is used for the internal text formatting mechanism during text extraction. So now during the text extraction using ‘Pure’ mode, you may specify the ScaleFactor option and it can be another approach to extract text from a multi-column PDF document besides the above-stated approach. In this new release, we also have introduced several improvements in TextAbsorber and in the internal text formatting mechanism. The namespace provides classes that allow to extract text add text manipulate existing text of a document. open document Document doc new Document(inFile) // create TextAbsorber. Public static void ExtractFromAllPages () Second approach - Using ScaleFactor Copy The example demonstrates how to extract text on the first PDF document page.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |