Contents

pdfminer.six 20221105

0

PDF parser and analyzer

PDF parser and analyzer

Stars: 4045, Watchers: 4045, Forks: 790, Open Issues: 128

The pdfminer/pdfminer.six repo was created 8 years ago and was last updated 13 hours ago.
The project is very popular with an impressive 4045 github stars!

How to Install pdfminer.six

You can install pdfminer.six using pip

pip install pdfminer.six

or add it to a project with poetry

poetry add pdfminer.six

Package Details

Author
Yusuke Shinyama + Philippe Guglielmetti
License
MIT/X
Homepage
https://github.com/pdfminer/pdfminer.six
PyPi
https://pypi.org/project/pdfminer.six/
GitHub Repo
https://github.com/pdfminer/pdfminer.six

Classifiers

  • Text Processing
No  pdfminer.six  pypi packages just yet.

Errors

A list of common pdfminer.six errors.

Code Examples

Here are some pdfminer.six code examples and snippets.

GitHub Issues

The pdfminer.six package has 128 open issues on GitHub

  • Add extras_require in setup.py for PIL, and raise error if not installed when needing PIL
  • encodingdb.name2unicode(name: str) -> str can't handle type1 font diff like: 2, /'MT110', /'MT50',…
  • reading order is not quite right formultiple columns in one page
  • Same sentence is printed three times for a specific PDF file when using pdf2txt
  • extract images including their textual Figure number/title etc located below the image in a pdf. ie a margin around the image to be captured as well.
  • extract heading and section headers from pdf…cant acheive this now
  • Prefer logging to warning
  • Fix regression in page layout that sometimes returned text lines out of order
  • Text out of order with pdfminer 20201018
  • getting lots of (cid:#) instead of readable text
  • Question: Negative bbox coordinate (x1)
  • split a multi-page pdf file into multiple pdf files
  • list index out of range at self.cmap.add_cid2unichr(s1+i, code[i])

See more issues on GitHub

Related Packages & Articles

syslogmp 0.4

A parser for BSD syslog protocol (RFC 3164) messages

pdfrw 0.4

pdfrw is a Python library and utility that reads and writes PDF files. PDF file reader/writer library

pdfplumber 0.7.6

Plumb a PDF for detailed information about each char, rectangle, and line.