Contents

pdfminer.six 20240706

0

PDF parser and analyzer

PDF parser and analyzer

Stars: 5877, Watchers: 5877, Forks: 928, Open Issues: 245

The pdfminer/pdfminer.six repo was created 10 years ago and the last code push was 2 months ago.
The project is extremely popular with a mindblowing 5877 github stars!

How to Install pdfminer-six

You can install pdfminer-six using pip

pip install pdfminer-six

or add it to a project with poetry

poetry add pdfminer-six

Package Details

Author
Yusuke Shinyama + Philippe Guglielmetti
License
MIT
Homepage
https://github.com/pdfminer/pdfminer.six
PyPi:
https://pypi.org/project/pdfminer.six/
GitHub Repo:
https://github.com/pdfminer/pdfminer.six

Classifiers

  • Text Processing
No  pdfminer-six  pypi packages just yet.

Errors

A list of common pdfminer-six errors.

Code Examples

Here are some pdfminer-six code examples and snippets.

GitHub Issues

The pdfminer-six package has 245 open issues on GitHub

  • Add extras_require in setup.py for PIL, and raise error if not installed when needing PIL
  • encodingdb.name2unicode(name: str) -> str can't handle type1 font diff like: 2, /'MT110', /'MT50',…
  • reading order is not quite right formultiple columns in one page
  • Same sentence is printed three times for a specific PDF file when using pdf2txt
  • extract images including their textual Figure number/title etc located below the image in a pdf. ie a margin around the image to be captured as well.
  • extract heading and section headers from pdf…cant acheive this now
  • Prefer logging to warning
  • Fix regression in page layout that sometimes returned text lines out of order
  • Text out of order with pdfminer 20201018
  • getting lots of (cid:#) instead of readable text
  • Question: Negative bbox coordinate (x1)
  • split a multi-page pdf file into multiple pdf files
  • list index out of range at self.cmap.add_cid2unichr(s1+i, code[i])

See more issues on GitHub

Related Packages & Articles

pdfkit 1.0.0

Wkhtmltopdf python wrapper to convert html to pdf using the webkit rendering engine and qt

pdf2image 1.17.0

A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.

oletools 0.60.2

Python tools to analyze security characteristics of MS Office and OLE files (also called Structured Storage, Compound File Binary Format or Compound Document File Format), for Malware Analysis and Incident Response #DFIR

nbconvert 7.16.4

Converting Jupyter Notebooks (.ipynb files) to other formats. Output formats include asciidoc, html, latex, markdown, pdf, py, rst, script. nbconvert can be used both as a Python library (import nbconvert) or as a command line tool (invoked as jupyter nbconvert ...).

lkml 1.3.5

A speedy LookML parser implemented in pure Python.