-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Web scraping ETL #752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You can already load web pages into a vector database using the Tika Dependency: dependencies {
...
implementation 'org.springframework.ai:spring-ai-tika-document-reader'
} Example: public void run() throws MalformedURLException {
List<Document> documents = new ArrayList<>();
logger.info("Loading .html files as Documents");
var documentUri = URI.create("https://docs.spring.io/spring-ai/reference/1.0-SNAPSHOT/concepts.html#_models");
var htmlReader = new TikaDocumentReader(new UrlResource(documentUri));
documents.addAll(htmlReader.get());
logger.info("Creating and storing Embeddings from Documents");
var textSplitter = new TokenTextSplitter();
vectorStore.add(textSplitter.split(documents));
var similarDocuments = vectorStore.similaritySearch(SearchRequest
.query("Retrieval Augmented Generation")
.withTopK(3)
.withSimilarityThreshold(0.75));
similarDocuments.forEach(doc -> System.out.println(doc.getContent()));
} |
There are some commons-compress version incompatibilities. I had to exclude and configure it as follows:
|
We can include these changes to the pom. How much more dedicated support over Tika is expected? The sample code reads well to me |
Is there a feature in the pipeline to support web scraping functionality - similar to what the LangChain library has to offer (https://python.langchain.com/v0.1/docs/use_cases/web_scraping/).
It is basically to load HTML pages from a web url and transform it to text, before chunking and indexing it to a Vector Store.
The text was updated successfully, but these errors were encountered: