

6·
15 days agoIt’s a curse because it’s used for things other than what it’s intended to. It’s doing a good job representing printed material, but unfortunately people very commonly expect it to be something more akin to a word processor file.
It’s a curse because it’s used for things other than what it’s intended to. It’s doing a good job representing printed material, but unfortunately people very commonly expect it to be something more akin to a word processor file.
I know the pain. While there are definitely solutions that work sometimes, there’s just no “one size fits all” that I’m aware of. PDFs can represent text very differently internally.
What I did for one project where extracting the text produced a complete mess was to convert the PDF pages to images and then OCR them…
Hate? Digital decluttering feels really good, for me anyway.
Interesting, I’ll keep it in mind next time I have to deal with this problem (hopefully never but who knows).
A few years ago I was in contact with researchers that were developing an AI tool to parse PDFs (I think they didn’t care about converting to editable formats, but extracting data), from their material I got the impression that it’s extremely difficult to do right using traditional algorithms.