Python package and command-line tool designed to gather text on the Web. It includes discovery, extr
Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments.
adbar/trafilatura repo was created 4 years ago and the last code push was 2 days ago.
The project is very popular with an impressive 1832 github stars!
How to Install trafilatura
You can install trafilatura using pip
pip install trafilatura
or add it to a project with poetry
poetry add trafilatura
- Adrien Barbaresi
- GitHub Repo:
- Scientific/Engineering/Information Analysis
- Text Editors/Text Processing
- Text Processing/Linguistic
- Text Processing/Markup/HTML
- Text Processing/Markup/Markdown
- Text Processing/Markup/XML
A list of common trafilatura errors.
Here are some
trafilatura code examples and snippets.
The trafilatura package has 52 open issues on GitHub
- maintenance: simplify code
- Web API idea
- Corrupted Markdown output when TXT+formatting
- Question about the title
- improve code support
- Empty h1 blocks non-empty h2
- author metadata field is null for YouTube videos
included_imagesfailed when trying to extract images in a table
- Redirecting https://twitter.com
- Is it possible to get the metadata with markdown format?
- Code tags are not parsed properly
- Image markdown not included during processing
- Code example for Multi-Threaded downloads seems out of date
- feat: use proxy to extract data
- xml extraction leads to <graphic> tags in the wrong place.