/pkg/t/trafilatura/trafilatura-banner.webp

trafilatura 1.12.2

Python package and command-line tool designed to gather text on the Web, includes all necessary disc

08-20-2023 262 words 2 minutes 0 views

Contents

Python package and command-line tool designed to gather text on the Web, includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction of main texts, metadata and comments.

3528 Stars :star:

Stars: 3528, Watchers: 3528, Forks: 255, Open Issues: 67

The adbar/trafilatura repo was created 5 years ago and the last code push was 2 days ago.
The project is very popular with an impressive 3528 github stars!

How to Install trafilatura

You can install trafilatura using pip

pip install trafilatura

or add it to a project with poetry

poetry add trafilatura

Package Details

Author: Adrien Barbaresi
License: Apache-2.0
Homepage: https://trafilatura.readthedocs.io
PyPi:: https://pypi.org/project/trafilatura/
Documentation:: https://trafilatura.readthedocs.io
GitHub Repo:: https://github.com/adbar/trafilatura

Classifiers

Internet/WWW/HTTP
Scientific/Engineering/Information Analysis
Security
Text Editors/Text Processing
Text Processing/Linguistic
Text Processing/Markup/HTML
Text Processing/Markup/Markdown
Text Processing/Markup/XML
Utilities

No trafilatura pypi packages just yet.

Errors

A list of common trafilatura errors.

Code Examples

Here are some trafilatura code examples and snippets.

GitHub Issues

The trafilatura package has 67 open issues on GitHub

maintenance: simplify code
Web API idea
Corrupted Markdown output when TXT+formatting
Question about the title
improve code support
Empty h1 blocks non-empty h2
author metadata field is null for YouTube videos
included_images failed when trying to extract images in a table
Redirecting https://twitter.com
Is it possible to get the metadata with markdown format?
Code tags are not parsed properly
Image markdown not included during processing
Code example for Multi-Threaded downloads seems out of date
feat: use proxy to extract data
xml extraction leads to <graphic> tags in the wrong place.

See more issues on GitHub

Related Packages & Articles

/pkg/n/news-please/news-please-banner.webp

news-please 1.6.13

news-please is an open source easy-to-use news extractor that just works.

pydude 0.28.0

Dude (Uncomplicated Data Extraction) is a Python framework designed for crafting web scrapers with ease. This Flask-inspired framework facilitates the rapid creation of web scrapers with its intuitive syntax. While it's still in pre-alpha, Dude doesn't skimp on features. It supports multiple parser backends like BeautifulSoup4 and LXML, and offers functionalities like URL pattern matching and data grouping.