Contents

pdfminer.six 20240706

0

PDF parser and analyzer

PDF parser and analyzer

Stars: 5877, Watchers: 5877, Forks: 928, Open Issues: 245

The pdfminer/pdfminer.six repo was created 10 years ago and the last code push was 2 months ago.
The project is extremely popular with a mindblowing 5877 github stars!

How to Install pdfminer.six

You can install pdfminer.six using pip

pip install pdfminer.six

or add it to a project with poetry

poetry add pdfminer.six

Package Details

Author
Yusuke Shinyama + Philippe Guglielmetti
License
MIT
Homepage
https://github.com/pdfminer/pdfminer.six
PyPi:
https://pypi.org/project/pdfminer.six/
GitHub Repo:
https://github.com/pdfminer/pdfminer.six

Classifiers

  • Text Processing
No  pdfminer.six  pypi packages just yet.

Errors

A list of common pdfminer.six errors.

Code Examples

Here are some pdfminer.six code examples and snippets.

GitHub Issues

The pdfminer.six package has 245 open issues on GitHub

  • Add extras_require in setup.py for PIL, and raise error if not installed when needing PIL
  • encodingdb.name2unicode(name: str) -> str can't handle type1 font diff like: 2, /'MT110', /'MT50',…
  • reading order is not quite right formultiple columns in one page
  • Same sentence is printed three times for a specific PDF file when using pdf2txt
  • extract images including their textual Figure number/title etc located below the image in a pdf. ie a margin around the image to be captured as well.
  • extract heading and section headers from pdf…cant acheive this now
  • Prefer logging to warning
  • Fix regression in page layout that sometimes returned text lines out of order
  • Text out of order with pdfminer 20201018
  • getting lots of (cid:#) instead of readable text
  • Question: Negative bbox coordinate (x1)
  • split a multi-page pdf file into multiple pdf files
  • list index out of range at self.cmap.add_cid2unichr(s1+i, code[i])

See more issues on GitHub

Related Packages & Articles

syslogmp 0.4

A parser for BSD syslog protocol (RFC 3164) messages

pdfrw 0.4

pdfrw is a Python library and utility that reads and writes PDF files.

pdfplumber 0.11.4

Plumb a PDF for detailed information about each char, rectangle, and line.