Skip to content
This repository was archived by the owner on Apr 21, 2025. It is now read-only.

how to get the dump data of stackoverflow and train it from scratch? #10

Closed
SeekPoint opened this issue Sep 1, 2020 · 2 comments
Closed
Assignees
Milestone

Comments

@SeekPoint
Copy link

No description provided.

@davidmezzetti
Copy link
Member

Once codequestion is installed, the following steps will help build the model from scratch. I'll keep this issue open to also document this in the README:

1.) Download files from Stack Exchange: https://archive.org/details/stackexchange

2.) Place selected files into a directory structure like shown below (current process requires all these files).

stackexchange/ai/ai.stackexchange.com.7z
stackexchange/android/android.stackexchange.com.7z
stackexchange/apple/apple.stackexchange.com.7z
stackexchange/arduino/arduino.stackexchange.com.7z
stackexchange/askubuntu/askubuntu.com.7z
stackexchange/avp/avp.stackexchange.com.7z
stackexchange/codereview/codereview.stackexchange.com.7z
stackexchange/cs/cs.stackexchange.com.7z
stackexchange/datascience/datascience.stackexchange.com.7z
stackexchange/dba/dba.stackexchange.com.7z
stackexchange/devops/devops.stackexchange.com.7z
stackexchange/dsp/dsp.stackexchange.com.7z
stackexchange/raspberrypi/raspberrypi.stackexchange.com.7z
stackexchange/reverseengineering/reverseengineering.stackexchange.com.7z
stackexchange/scicomp/scicomp.stackexchange.com.7z
stackexchange/security/security.stackexchange.com.7z
stackexchange/serverfault/serverfault.com.7z
stackexchange/stackoverflow/stackoverflow.com-Posts.7z
stackexchange/stats/stats.stackexchange.com.7z
stackexchange/superuser/superuser.com.7z
stackexchange/unix/unix.stackexchange.com.7z
stackexchange/vi/vi.stackexchange.com.7z
stackexchange/wordpress/wordpress.stackexchange.com.7z

3.) Run the ETL process

python -m codequestion.etl.stackexchange.execute stackexchange

This will create the file ~/.codequestion/models/stackexchange/questions.db

4.) Build word vectors

Currently, the model is using BM25 + fastText for indexing, should this change this step isn't necessary

python -m codequestion.vectors

This will create the file ~/.codequestion/vectors/stackexchange-300d.magnitude

5.) Build index

python -m codequestion.index

After this step, the index is created and all necessary files are ready to query.

@SeekPoint
Copy link
Author

it works!

nice job

@davidmezzetti davidmezzetti self-assigned this Oct 3, 2022
@davidmezzetti davidmezzetti added this to the v1.1.0 milestone Oct 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants