Contents

pdfminer.six 20220524

0

PDF parser and analyzer

PDF parser and analyzer

Stars: 3863, Watchers: 3863, Forks: 776, Open Issues: 114

The pdfminer/pdfminer.six repo was created 8 years ago and was last updated 6 hours ago.
The project is very popular with an impressive 3863 github stars!

How to Install pdfminer.six

You can install pdfminer.six using pip

pip install pdfminer.six

or add it to a project with poetry

poetry add pdfminer.six

Package Details

Author
Yusuke Shinyama + Philippe Guglielmetti
License
MIT/X
Homepage
https://github.com/pdfminer/pdfminer.six
PyPi
https://pypi.org/project/pdfminer.six/
GitHub Repo
https://github.com/pdfminer/pdfminer.six

Classifiers

  • Text Processing
No  pdfminer.six  pypi packages just yet.

Errors

A list of common pdfminer.six errors.

Code Examples

Here are some pdfminer.six code examples and snippets.

GitHub Issues

The pdfminer.six package has 114 open issues on GitHub

  • Add extras_require in setup.py for PIL, and raise error if not installed when needing PIL
  • encodingdb.name2unicode(name: str) -> str can't handle type1 font diff like: 2, /'MT110', /'MT50',…
  • reading order is not quite right formultiple columns in one page
  • Same sentence is printed three times for a specific PDF file when using pdf2txt
  • extract images including their textual Figure number/title etc located below the image in a pdf. ie a margin around the image to be captured as well.
  • extract heading and section headers from pdf…cant acheive this now
  • Prefer logging to warning
  • Fix regression in page layout that sometimes returned text lines out of order
  • Text out of order with pdfminer 20201018
  • getting lots of (cid:#) instead of readable text
  • Question: Negative bbox coordinate (x1)
  • split a multi-page pdf file into multiple pdf files
  • list index out of range at self.cmap.add_cid2unichr(s1+i, code[i])

See more issues on GitHub

Related Packages & Articles

syslogmp 0.4

A parser for BSD syslog protocol (RFC 3164) messages

pdfrw 0.4

pdfrw is a Python library and utility that reads and writes PDF files. PDF file reader/writer library

pdfplumber 0.7.4

Plumb a PDF for detailed information about each char, rectangle, and line.