Mapping the Harvard Classics to Project Gutenberg
I wanted good digital copies of the Harvard Classics — the fifty-volume set edited by Charles Eliot in 1909.
Most of the works are on Project Gutenberg. But finding exact matches is tedious. Titles vary, author names vary, multiple editions appear. I had a list of 300 works in a CSV. Manually searching wasn’t practical.
So I wrote a script.
Approach
I used the Gutendex API, a simple JSON interface to Project Gutenberg. It supports fuzzy search and returns clean data.
Basic search:
def search_gutenberg(self, title, author=""):
search_query = f"{title} {author}".strip()
encoded_query = quote_plus(search_query)
url = f"https://gutendx.com/books?search={encoded_query}"
response = self.session.get(url)
if response.status_code == 200:
return response.json()
return None
To select the best match, I combined title and author similarity:
def find_best_match(self, target_title, target_author, results):
best_score = 0
best_match = None
for book in results:
title_score = SequenceMatcher(
None,
target_title.lower(),
book['title'].lower()
).ratio() * 0.7
author_score = 0
if target_author and book.get('authors'):
author_name = book['authors'][0]['name'].lower()
author_score = SequenceMatcher(
None,
target_author.lower(),
author_name
).ratio() * 0.3
total_score = title_score + author_score
if total_score > best_score and total_score > 0.3:
best_score = total_score
best_match = book
return best_match
For each match, the script also extracts basic metadata:
def get_book_metadata(self, book):
return {
'gutenberg_id': book.get('id'),
'gutenberg_url': f"https://www.gutenberg.org/ebooks/{book.get('id')}",
'matched_title': book.get('title', ''),
'download_count': book.get('download_count', 0),
'subjects': '; '.join(book.get('subjects', [])[:5]),
'formats_available': len(book.get('formats', {}))
}
To handle long runs and interruptions, it saves progress every ten matches or on SIGINT:
signal.signal(signal.SIGINT, self.signal_handler)
def signal_handler(self, signum, frame):
print("\nSaving progress...")
if self.df is not None:
self.df.to_csv(self.output_file, index=False)
sys.exit(0)
Results
Run:
python harvard_gutenberg_mapper.py input.csv output.csv
Sample output:
[1/300] Processing: 'The Autobiography of Benjamin Franklin' by 'Benjamin Franklin'
✓ Found: Autobiography of Benjamin Franklin (ID: 20203, Downloads: 15,234)
[2/300] Processing: 'Crime and Punishment' by 'Fyodor Dostoyevsky'
✓ Found: Crime and Punishment (ID: 2554, Downloads: 30,161)
The script matched 185 out of 300 works (~62%). Gaps were mostly due to editions not available on Gutenberg or ambiguous titles.
Summary
I wanted a reproducible way to map the Harvard Classics to public domain digital editions. This script does that. It’s available here:
GitHub: harvard-classics-gutenberg-mapper
The repo includes the complete dataset with matches and metadata.