Contents

dedupe 2.0.23

0

A python library for accurate and scaleable data deduplication and entity-resolution

A python library for accurate and scaleable data deduplication and entity-resolution

Stars: 3967, Watchers: 3967, Forks: 538, Open Issues: 78

The dedupeio/dedupe repo was created 11 years ago and the last code push was 3 weeks ago.
The project is very popular with an impressive 3967 github stars!

How to Install dedupe

You can install dedupe using pip

pip install dedupe

or add it to a project with poetry

poetry add dedupe

Package Details

Author
License
The MIT License (MIT) Copyright (c) 2014 Forest Gregg, Derek Eder, DataMade and Contributors Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Homepage
PyPi:
https://pypi.org/project/dedupe/
Documentation:
https://docs.dedupe.io/en/latest/
GitHub Repo:
https://github.com/dedupeio/dedupe

Classifiers

  • Scientific/Engineering
  • Scientific/Engineering/Information Analysis
  • Software Development/Libraries/Python Modules
No  dedupe  pypi packages just yet.

Errors

A list of common dedupe errors.

Code Examples

Here are some dedupe code examples and snippets.

GitHub Issues

The dedupe package has 78 open issues on GitHub

  • Add Docs build to CI
  • Performance degrades when loading/training with large labeled training file to prepare_train()
  • partially supervised classification
  • deprecate recall argument for precision and expose argument for tree depth
  • use sqlite's fts5 for tf/idf index predicates
  • blocking for partitioning
  • ergonomics for working with differently named fields in linkage and gazetteer mode
  • virtual compound predicate
  • Bring back multiple matches now that we have SQL based blocking
  • Document how to create a new variable type
  • Improve the sampling
  • Parallelize blocking (Fingerprinter)
  • A different approach to "community detection"?
  • sorted neighborhoods
  • index predicates always used by labeler

See more issues on GitHub