Contents

trafilatura 1.6.1

0

Python package and command-line tool designed to gather text on the Web. It includes discovery, extr

Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments.

Stars: 1832, Watchers: 1832, Forks: 148, Open Issues: 52

The adbar/trafilatura repo was created 4 years ago and the last code push was 2 days ago.
The project is very popular with an impressive 1832 github stars!

How to Install trafilatura

You can install trafilatura using pip

pip install trafilatura

or add it to a project with poetry

poetry add trafilatura

Package Details

Author
Adrien Barbaresi
License
GPLv3+
Homepage
https://trafilatura.readthedocs.io
PyPi:
https://pypi.org/project/trafilatura/
Documentation:
https://trafilatura.readthedocs.io
GitHub Repo:
https://github.com/adbar/trafilatura

Classifiers

  • Internet/WWW/HTTP
  • Scientific/Engineering/Information Analysis
  • Security
  • Text Editors/Text Processing
  • Text Processing/Linguistic
  • Text Processing/Markup/HTML
  • Text Processing/Markup/Markdown
  • Text Processing/Markup/XML
  • Utilities
No  trafilatura  pypi packages just yet.

Errors

A list of common trafilatura errors.

Code Examples

Here are some trafilatura code examples and snippets.

GitHub Issues

The trafilatura package has 52 open issues on GitHub

  • maintenance: simplify code
  • Web API idea
  • Corrupted Markdown output when TXT+formatting
  • Question about the title
  • improve code support
  • Empty h1 blocks non-empty h2
  • author metadata field is null for YouTube videos
  • included_images failed when trying to extract images in a table
  • Redirecting https://twitter.com
  • Is it possible to get the metadata with markdown format?
  • Code tags are not parsed properly
  • Image markdown not included during processing
  • Code example for Multi-Threaded downloads seems out of date
  • feat: use proxy to extract data
  • xml extraction leads to <graphic> tags in the wrong place.

See more issues on GitHub

Related Packages & Articles

news-please 1.5.33

news-please is an open source easy-to-use news extractor that just works.

pydude 0.27.0

Dude (Uncomplicated Data Extraction) is a Python framework designed for crafting web scrapers with ease. This Flask-inspired framework facilitates the rapid creation of web scrapers with its intuitive syntax. While it's still in pre-alpha, Dude doesn't skimp on features. It supports multiple parser backends like BeautifulSoup4 and LXML, and offers functionalities like URL pattern matching and data grouping.

Scrapy 2.9.0

A high-level Web Crawling and Web Scraping framework

gnews 0.3.1

Provide an API to search for articles on Google News and returns a usable JSON response.

google-search-results 2.4.2

Scrape and search localized results from Google, Bing, Baidu, Yahoo, Yandex, Ebay, Homedepot, youtube at scale using SerpApi.com

sumy 0.11.0

Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains simple evaluation framework for text summaries. Implemented summarization methods are described in the documentation.