I am trying to convert some pdf(released for public use) pages with articles/texts into calc sheets.
I use oo draw to open the file and use basic to fetch data from index of each page.
Due to the nature of pdf files, texts are splitted into tiny parts and their positions on the pdf page is given as X/Y positions.
And these elements of pdf page do not necessarily follow any possibler order in the index of pdf page.
Some elements that shown on first rows of the pdf page, are given as last elements of index.
In order to set the accurate positioning of each element from pdf in a calc sheet, I tried dividing its vertical position by the height of the original pdf which is 30500 pixels in this particular case; respectively horizontal position by the width of the original pdf which is 21000 pixels.
I don't need if they look in order when I transfer the data into the calc sheet, for example between some full rows there are blank rows, that aren on the original document. Since because I can bring full neighbor cells into together in a particular cell on calc sheet, using another macro, I can fix such problems easily.
my problem is, since some elements' vertical position in the index of a pdf page, is given less than the former ones, when I transfer the text into the calc sheet I lose original word order. Even though the differences are so minor that it can be ignorable, since there are hundreds and thousands of pages, I can't manage setting an accurate algorithm.
Text: "(IV)" Vertical position 6972
Text: "Рынки состояли из торговых рядов, которые в" Vertical Position: 6939
I'm trying to figure out a way to ensure the word order to be kept.
Maybe there are some other tools that could be useful which I did not realize.
Is there any alternative approach to convert positions into cells?
You can see an example in the attached document. I gave the horizontal and vertical ranks of each tiny parts of texts in a calc sheet.
If I can find a solution to this, I believe 2023 will be my lucky year
[Dropped] Convert pixel positions to cell referencess
-
- Posts: 103
- Joined: Mon Sep 15, 2014 7:34 pm
[Dropped] Convert pixel positions to cell referencess
Last edited by MrProgrammer on Sun Mar 26, 2023 5:47 pm, edited 1 time in total.
Reason: Dropped: No attachment provided when requested -- MrProgrammer, forum moderator
Reason: Dropped: No attachment provided when requested -- MrProgrammer, forum moderator
Win10-OpenOffice 4.1/LibreOffice 7.4
Re: Conversion Of Positions Given as Pixel Into Cells
Hi,
parsing pdf into a senseful document format is highly nontrivial. I have made very good experience with the Extract Text function of PDFBOX https://pdfbox.apache.org/ and iText7 https://itextpdf.com/products/itext-7 (license required for commercial application). The former is Java, so it should be possible in principle to make an LO extension from it.
Good luck ms777
parsing pdf into a senseful document format is highly nontrivial. I have made very good experience with the Extract Text function of PDFBOX https://pdfbox.apache.org/ and iText7 https://itextpdf.com/products/itext-7 (license required for commercial application). The former is Java, so it should be possible in principle to make an LO extension from it.
Good luck ms777
-
- Posts: 103
- Joined: Mon Sep 15, 2014 7:34 pm
Re: Conversion Of Positions Given as Pixel Into Cells
Hello. Thank you for your reply. I'd prefer do the conversion using the openoffice so I can keep pdf files without converting, and easily inputting my own sheets.
I need a minor incentive to complete my own macro in this point.
I've looked for a cursor layer-ish solution that will track each element on a pdf page from top left to bottom right.
so while using positions that are given as pixel, I can set accurate position any element on the pdf page.
I need a minor incentive to complete my own macro in this point.
I've looked for a cursor layer-ish solution that will track each element on a pdf page from top left to bottom right.
so while using positions that are given as pixel, I can set accurate position any element on the pdf page.
Win10-OpenOffice 4.1/LibreOffice 7.4
Re: Conversion Of Positions Given as Pixel Into Cells
+1. I'd say virtually impossible.
I am not surprised. A PDF is designed to display its contents identically on multiple platforms. It is not designed to be edited by anything except Adobe Acrobat. Some of the necessary tools to edit a PDF are proprietary to Adobe and not available. See [Tutorial] How do I view or edit a PDF file with OpenOffice?sokolowitzky wrote: ↑Sun Jan 01, 2023 12:42 pm Due to the nature of pdf files, texts are splitted into tiny parts and their positions on the pdf page is given as X/Y positions.
And these elements of pdf page do not necessarily follow any possible order in the index of pdf page.
Some elements that shown on first rows of the pdf page, are given as last elements of index.
Please upload a small pdf file you want to parse so that it can be analysed for alternative solutions. Explain what you want the .ods file to look like.
Press POSTREPLY and click the Upload attachment tab below where you type (128 kB max); or use a file share site such as mediafire, Dropbox or Google Drive for a larger file.
LO 6.4.4.2, Windows 10 Home 64 bit
See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.
Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.
Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
Re: Conversion Of Positions Given as Pixel Into Cells
You've provided a document with no code in and a sheet called "Positioning shown on PDF" where for example the viewcurser set at E44 "бывает туман." has a position of 2999,32235. (not "pixels" but 100th of a mm). And a "raw data" sheet with completely different numbers for that location - 2161, 16384.
If you explain a bit more what your process is, from the start, it might help people help you more.
If you explain a bit more what your process is, from the start, it might help people help you more.
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)