Mapping the Harvard Classics to Project Gutenberg

I wanted good digital copies of the Harvard Classics — the fifty-volume set edited by Charles Eliot in 1909.

Most of the works are on Project Gutenberg. But finding exact matches is tedious. Titles vary, author names vary, multiple editions appear. I had a list of 300 works in a CSV. Manually searching wasn’t practical.

So I wrote a script.

Approach

I used the Gutendex API, a simple JSON interface to Project Gutenberg. It supports fuzzy search and returns clean data.

Basic search:

def search_gutenberg(self, title, author=""):
    search_query = f"{title} {author}".strip()
    encoded_query = quote_plus(search_query)
    url = f"https://gutendx.com/books?search={encoded_query}"

    response = self.session.get(url)
    if response.status_code == 200:
        return response.json()
    return None

To select the best match, I combined title and author similarity:

def find_best_match(self, target_title, target_author, results):
    best_score = 0
    best_match = None

    for book in results:
        title_score = SequenceMatcher(
            None,
            target_title.lower(),
            book['title'].lower()
        ).ratio() * 0.7

        author_score = 0
        if target_author and book.get('authors'):
            author_name = book['authors'][0]['name'].lower()
            author_score = SequenceMatcher(
                None,
                target_author.lower(),
                author_name
            ).ratio() * 0.3

        total_score = title_score + author_score

        if total_score > best_score and total_score > 0.3:
            best_score = total_score
            best_match = book

    return best_match

For each match, the script also extracts basic metadata:

def get_book_metadata(self, book):
    return {
        'gutenberg_id': book.get('id'),
        'gutenberg_url': f"https://www.gutenberg.org/ebooks/{book.get('id')}",
        'matched_title': book.get('title', ''),
        'download_count': book.get('download_count', 0),
        'subjects': '; '.join(book.get('subjects', [])[:5]),
        'formats_available': len(book.get('formats', {}))
    }

To handle long runs and interruptions, it saves progress every ten matches or on SIGINT:

signal.signal(signal.SIGINT, self.signal_handler)

def signal_handler(self, signum, frame):
    print("\nSaving progress...")
    if self.df is not None:
        self.df.to_csv(self.output_file, index=False)
    sys.exit(0)

Results

Run:

python harvard_gutenberg_mapper.py input.csv output.csv

Sample output:

[1/300] Processing: 'The Autobiography of Benjamin Franklin' by 'Benjamin Franklin'
  ✓ Found: Autobiography of Benjamin Franklin (ID: 20203, Downloads: 15,234)

[2/300] Processing: 'Crime and Punishment' by 'Fyodor Dostoyevsky'
  ✓ Found: Crime and Punishment (ID: 2554, Downloads: 30,161)

The script matched 185 out of 300 works (~62%). Gaps were mostly due to editions not available on Gutenberg or ambiguous titles.

Summary

I wanted a reproducible way to map the Harvard Classics to public domain digital editions. This script does that. It’s available here:

GitHub: harvard-classics-gutenberg-mapper


The repo includes the complete dataset with matches and metadata.