dedupe 3.0.3
0
A python library for accurate and scaleable data deduplication and entity-resolution
Contents
A python library for accurate and scaleable data deduplication and entity-resolution
Stars: 4118, Watchers: 4118, Forks: 551, Open Issues: 75The dedupeio/dedupe
repo was created 12 years ago and the last code push was 5 days ago.
The project is very popular with an impressive 4118 github stars!
How to Install dedupe
You can install dedupe using pip
pip install dedupe
or add it to a project with poetry
poetry add dedupe
Package Details
- Author
- None
- License
- The MIT License (MIT) Copyright (c) 2014 Forest Gregg, Derek Eder, DataMade and Contributors Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- Homepage
- None
- PyPi:
- https://pypi.org/project/dedupe/
- Documentation:
- https://docs.dedupe.io/en/latest/
- GitHub Repo:
- https://github.com/dedupeio/dedupe
Classifiers
- Scientific/Engineering
- Scientific/Engineering/Information Analysis
- Software Development/Libraries/Python Modules
Related Packages
Errors
A list of common dedupe errors.
Code Examples
Here are some dedupe
code examples and snippets.
GitHub Issues
The dedupe package has 75 open issues on GitHub
- Add Docs build to CI
- Performance degrades when loading/training with large labeled training file to prepare_train()
- partially supervised classification
- deprecate recall argument for precision and expose argument for tree depth
- use sqlite's fts5 for tf/idf index predicates
- blocking for partitioning
- ergonomics for working with differently named fields in linkage and gazetteer mode
- virtual compound predicate
- Bring back multiple matches now that we have SQL based blocking
- Document how to create a new variable type
- Improve the sampling
- Parallelize blocking (Fingerprinter)
- A different approach to "community detection"?
- sorted neighborhoods
- index predicates always used by labeler