[Dropped] Convert pixel positions to cell referencess

Creating a macro - Writing a Script - Using the API (OpenOffice Basic, Python, BeanShell, JavaScript)
Post Reply
sokolowitzky
Posts: 103
Joined: Mon Sep 15, 2014 7:34 pm

[Dropped] Convert pixel positions to cell referencess

Post by sokolowitzky »

I am trying to convert some pdf(released for public use) pages with articles/texts into calc sheets.
I use oo draw to open the file and use basic to fetch data from index of each page.

Due to the nature of pdf files, texts are splitted into tiny parts and their positions on the pdf page is given as X/Y positions.
And these elements of pdf page do not necessarily follow any possibler order in the index of pdf page.
Some elements that shown on first rows of the pdf page, are given as last elements of index.

In order to set the accurate positioning of each element from pdf in a calc sheet, I tried dividing its vertical position by the height of the original pdf which is 30500 pixels in this particular case; respectively horizontal position by the width of the original pdf which is 21000 pixels.

I don't need if they look in order when I transfer the data into the calc sheet, for example between some full rows there are blank rows, that aren on the original document. Since because I can bring full neighbor cells into together in a particular cell on calc sheet, using another macro, I can fix such problems easily.

my problem is, since some elements' vertical position in the index of a pdf page, is given less than the former ones, when I transfer the text into the calc sheet I lose original word order. Even though the differences are so minor that it can be ignorable, since there are hundreds and thousands of pages, I can't manage setting an accurate algorithm.
expl.PNG
expl.PNG (2.98 KiB) Viewed 1022 times
Text: "(IV)" Vertical position 6972
Text: "Рынки состояли из торговых рядов, которые в" Vertical Position: 6939

I'm trying to figure out a way to ensure the word order to be kept.
Maybe there are some other tools that could be useful which I did not realize.
Is there any alternative approach to convert positions into cells?
You can see an example in the attached document. I gave the horizontal and vertical ranks of each tiny parts of texts in a calc sheet.
Pixel to cell conversion.ods
(70.98 KiB) Downloaded 53 times
If I can find a solution to this, I believe 2023 will be my lucky year :)
Last edited by MrProgrammer on Sun Mar 26, 2023 5:47 pm, edited 1 time in total.
Reason: Dropped: No attachment provided when requested -- MrProgrammer, forum moderator
Win10-OpenOffice 4.1/LibreOffice 7.4
ms777
Volunteer
Posts: 177
Joined: Mon Oct 08, 2007 1:33 am

Re: Conversion Of Positions Given as Pixel Into Cells

Post by ms777 »

Hi,
parsing pdf into a senseful document format is highly nontrivial. I have made very good experience with the Extract Text function of PDFBOX https://pdfbox.apache.org/ and iText7 https://itextpdf.com/products/itext-7 (license required for commercial application). The former is Java, so it should be possible in principle to make an LO extension from it.
Good luck ms777
sokolowitzky
Posts: 103
Joined: Mon Sep 15, 2014 7:34 pm

Re: Conversion Of Positions Given as Pixel Into Cells

Post by sokolowitzky »

Hello. Thank you for your reply. I'd prefer do the conversion using the openoffice so I can keep pdf files without converting, and easily inputting my own sheets.
I need a minor incentive to complete my own macro in this point.
I've looked for a cursor layer-ish solution that will track each element on a pdf page from top left to bottom right.
so while using positions that are given as pixel, I can set accurate position any element on the pdf page.
Win10-OpenOffice 4.1/LibreOffice 7.4
John_Ha
Volunteer
Posts: 9584
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: Conversion Of Positions Given as Pixel Into Cells

Post by John_Ha »

ms777 wrote: Mon Jan 02, 2023 9:25 am parsing pdf into a senseful document format is highly nontrivial.
+1. I'd say virtually impossible.
sokolowitzky wrote: Sun Jan 01, 2023 12:42 pm Due to the nature of pdf files, texts are splitted into tiny parts and their positions on the pdf page is given as X/Y positions.
And these elements of pdf page do not necessarily follow any possible order in the index of pdf page.
Some elements that shown on first rows of the pdf page, are given as last elements of index.
I am not surprised. A PDF is designed to display its contents identically on multiple platforms. It is not designed to be edited by anything except Adobe Acrobat. Some of the necessary tools to edit a PDF are proprietary to Adobe and not available. See [Tutorial] How do I view or edit a PDF file with OpenOffice?

Please upload a small pdf file you want to parse so that it can be analysed for alternative solutions. Explain what you want the .ods file to look like.

Press POSTREPLY and click the Upload attachment tab below where you type (128 kB max); or use a file share site such as mediafire, Dropbox or Google Drive for a larger file.
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
JeJe
Volunteer
Posts: 2787
Joined: Wed Mar 09, 2016 2:40 pm

Re: Conversion Of Positions Given as Pixel Into Cells

Post by JeJe »

You've provided a document with no code in and a sheet called "Positioning shown on PDF" where for example the viewcurser set at E44 "бывает туман." has a position of 2999,32235. (not "pixels" but 100th of a mm). And a "raw data" sheet with completely different numbers for that location - 2161, 16384.

If you explain a bit more what your process is, from the start, it might help people help you more.
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
Post Reply