trafilatura 1.12.2
0
Python package and command-line tool designed to gather text on the Web, includes all necessary disc
Contents
Python package and command-line tool designed to gather text on the Web, includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction of main texts, metadata and comments.
Stars: 3528, Watchers: 3528, Forks: 255, Open Issues: 67The adbar/trafilatura
repo was created 5 years ago and the last code push was 2 days ago.
The project is very popular with an impressive 3528 github stars!
How to Install trafilatura
You can install trafilatura using pip
pip install trafilatura
or add it to a project with poetry
poetry add trafilatura
Package Details
- Author
- Adrien Barbaresi
- License
- Apache-2.0
- Homepage
- https://trafilatura.readthedocs.io
- PyPi:
- https://pypi.org/project/trafilatura/
- Documentation:
- https://trafilatura.readthedocs.io
- GitHub Repo:
- https://github.com/adbar/trafilatura
Classifiers
- Internet/WWW/HTTP
- Scientific/Engineering/Information Analysis
- Security
- Text Editors/Text Processing
- Text Processing/Linguistic
- Text Processing/Markup/HTML
- Text Processing/Markup/Markdown
- Text Processing/Markup/XML
- Utilities
Related Packages
Errors
A list of common trafilatura errors.
Code Examples
Here are some trafilatura
code examples and snippets.
GitHub Issues
The trafilatura package has 67 open issues on GitHub
- maintenance: simplify code
- Web API idea
- Corrupted Markdown output when TXT+formatting
- Question about the title
- improve code support
- Empty h1 blocks non-empty h2
- author metadata field is null for YouTube videos
included_images
failed when trying to extract images in a table- Redirecting https://twitter.com
- Is it possible to get the metadata with markdown format?
- Code tags are not parsed properly
- Image markdown not included during processing
- Code example for Multi-Threaded downloads seems out of date
- feat: use proxy to extract data
- xml extraction leads to <graphic> tags in the wrong place.