/pkg/p/pdfminer/pdfminer.six-banner.webp

pdfminer.six 20240706

PDF parser and analyzer

01-20-2022 272 words 2 minutes 0 views

Contents

PDF parser and analyzer

Stars: 5877, Watchers: 5877, Forks: 928, Open Issues: 245

The pdfminer/pdfminer.six repo was created 10 years ago and the last code push was 2 months ago.
The project is extremely popular with a mindblowing 5877 github stars!

How to Install pdfminer.six

You can install pdfminer.six using pip

pip install pdfminer.six

or add it to a project with poetry

poetry add pdfminer.six

Package Details

Author: Yusuke Shinyama + Philippe Guglielmetti
License: MIT
Homepage: https://github.com/pdfminer/pdfminer.six
PyPi:: https://pypi.org/project/pdfminer.six/
GitHub Repo:: https://github.com/pdfminer/pdfminer.six

Classifiers

Text Processing

No pdfminer.six pypi packages just yet.

Errors

A list of common pdfminer.six errors.

Code Examples

Here are some pdfminer.six code examples and snippets.

GitHub Issues

The pdfminer.six package has 245 open issues on GitHub

Add extras_require in setup.py for PIL, and raise error if not installed when needing PIL
encodingdb.name2unicode(name: str) -> str can't handle type1 font diff like: 2, /'MT110', /'MT50',…
reading order is not quite right formultiple columns in one page
Same sentence is printed three times for a specific PDF file when using pdf2txt
extract images including their textual Figure number/title etc located below the image in a pdf. ie a margin around the image to be captured as well.
extract heading and section headers from pdf…cant acheive this now
Prefer logging to warning
Fix regression in page layout that sometimes returned text lines out of order
Text out of order with pdfminer 20201018
getting lots of (cid:#) instead of readable text
Question: Negative bbox coordinate (x1)
split a multi-page pdf file into multiple pdf files
list index out of range at self.cmap.add_cid2unichr(s1+i, code[i])