Skip to content

enginoid/arxiv-qa-playground

Repository files navigation

Arxiv QA

Retrieval-augmented generation example that answers questions from Arxiv abstracts and titles.

arxiv-retrieval-anns (Video sped up 3x.)

Setup

  • Copy secrets-example.json and replace with your own key.
  • Fetch arxiv-metadata-oai-snapshot.json
    • kaggle datasets download -d Cornell-University/arxiv
  • Run preprocess_dataset.py
    • Input file: arxiv-metadata-oai-snapshot.json
    • Output file: documents.json (a bit smaller)
  • docker compose up -d to run MeiliSearch and Qdrant
  • Then
    • ingest_to_meilisearch.py
    • ingest_to_qdrant.py
      • You'll want a GPU 😁, use nvitop to check it's using GPU.
      • Example performance: g5.xlarge (1x A10G), ~600k abstracts, ~12 minutes
  • Finally query.py to ask some questions.

Other tips

  • You can connect to a nice server to test Meilisearch keyword lookup on http://localhost:8080/
  • cli.py could be useful but at the moment only exposes meilisearch_index and meilisearch_client

About

Answers questions over arXiv titles and abstracts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published