[Dropped] Convert pixel positions to cell referencess

sokolowitzky · Post by **sokolowitzky** » Sun Jan 01, 2023 12:42 pm

I am trying to convert some pdf(released for public use) pages with articles/texts into calc sheets.
I use oo draw to open the file and use basic to fetch data from index of each page.

Due to the nature of pdf files, texts are splitted into tiny parts and their positions on the pdf page is given as X/Y positions.
And these elements of pdf page do not necessarily follow any possibler order in the index of pdf page.
Some elements that shown on first rows of the pdf page, are given as last elements of index.

In order to set the accurate positioning of each element from pdf in a calc sheet, I tried dividing its vertical position by the height of the original pdf which is 30500 pixels in this particular case; respectively horizontal position by the width of the original pdf which is 21000 pixels.

I don't need if they look in order when I transfer the data into the calc sheet, for example between some full rows there are blank rows, that aren on the original document. Since because I can bring full neighbor cells into together in a particular cell on calc sheet, using another macro, I can fix such problems easily.

my problem is, since some elements' vertical position in the index of a pdf page, is given less than the former ones, when I transfer the text into the calc sheet I lose original word order. Even though the differences are so minor that it can be ignorable, since there are hundreds and thousands of pages, I can't manage setting an accurate algorithm.

: expl.PNG (2.98 KiB) Viewed 1022 times

Text: "(IV)" Vertical position 6972
Text: "Рынки состояли из торговых рядов, которые в" Vertical Position: 6939

I'm trying to figure out a way to ensure the word order to be kept.
Maybe there are some other tools that could be useful which I did not realize.
Is there any alternative approach to convert positions into cells?
You can see an example in the attached document. I gave the horizontal and vertical ranks of each tiny parts of texts in a calc sheet.

Pixel to cell conversion.ods: (70.98 KiB) Downloaded 53 times

If I can find a solution to this, I believe 2023 will be my lucky year

ms777 · Post by **ms777** » Mon Jan 02, 2023 9:25 am

Hi,
parsing pdf into a senseful document format is highly nontrivial. I have made very good experience with the Extract Text function of PDFBOX https://pdfbox.apache.org/ and iText7 https://itextpdf.com/products/itext-7 (license required for commercial application). The former is Java, so it should be possible in principle to make an LO extension from it.
Good luck ms777

sokolowitzky · Post by **sokolowitzky** » Tue Jan 03, 2023 8:48 am

Hello. Thank you for your reply. I'd prefer do the conversion using the openoffice so I can keep pdf files without converting, and easily inputting my own sheets.
I need a minor incentive to complete my own macro in this point.
I've looked for a cursor layer-ish solution that will track each element on a pdf page from top left to bottom right.
so while using positions that are given as pixel, I can set accurate position any element on the pdf page.

Post by **John_Ha** » Tue Jan 03, 2023 12:24 pm

ms777 wrote: ↑Mon Jan 02, 2023 9:25 am parsing pdf into a senseful document format is highly nontrivial.

+1. I'd say virtually impossible.

sokolowitzky wrote: ↑Sun Jan 01, 2023 12:42 pm Due to the nature of pdf files, texts are splitted into tiny parts and their positions on the pdf page is given as X/Y positions.
And these elements of pdf page do not necessarily follow any possible order in the index of pdf page.
Some elements that shown on first rows of the pdf page, are given as last elements of index.

I am not surprised. A PDF is designed to display its contents identically on multiple platforms. It is not designed to be edited by anything except Adobe Acrobat. Some of the necessary tools to edit a PDF are proprietary to Adobe and not available. See [Tutorial] How do I view or edit a PDF file with OpenOffice?

Please upload a small pdf file you want to parse so that it can be analysed for alternative solutions. Explain what you want the .ods file to look like.

Press POSTREPLY and click the Upload attachment tab below where you type (128 kB max); or use a file share site such as mediafire, Dropbox or Google Drive for a larger file.

Post by **JeJe** » Tue Jan 03, 2023 1:08 pm

You've provided a document with no code in and a sheet called "Positioning shown on PDF" where for example the viewcurser set at E44 "бывает туман." has a position of 2999,32235. (not "pixels" but 100th of a mm). And a "raw data" sheet with completely different numbers for that location - 2161, 16384.

If you explain a bit more what your process is, from the start, it might help people help you more.

[Dropped] Convert pixel positions to cell referencess

[Dropped] Convert pixel positions to cell referencess

Re: Conversion Of Positions Given as Pixel Into Cells

Re: Conversion Of Positions Given as Pixel Into Cells

Re: Conversion Of Positions Given as Pixel Into Cells

Re: Conversion Of Positions Given as Pixel Into Cells