Skip to content

Web scraping ETL #752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
iAMSagar44 opened this issue May 23, 2024 · 3 comments
Open

Web scraping ETL #752

iAMSagar44 opened this issue May 23, 2024 · 3 comments

Comments

@iAMSagar44
Copy link
Contributor

Is there a feature in the pipeline to support web scraping functionality - similar to what the LangChain library has to offer (https://python.langchain.com/v0.1/docs/use_cases/web_scraping/).

It is basically to load HTML pages from a web url and transform it to text, before chunking and indexing it to a Vector Store.

@iAMSagar44 iAMSagar44 changed the title Web scraping (url document loader functionality) [Feature] Web scraping (url document loader functionality) May 23, 2024
@ThomasVitale
Copy link
Contributor

You can already load web pages into a vector database using the Tika DocumentReader, but it would be great to have dedicated support for the web scraping use case. For example, it would be great having the possibility to customise the loading and transformation/splitting of web pages in an HTML-aware way (similar to what LangChain and LlamaIndex support.

Dependency:

dependencies {
	...
	implementation 'org.springframework.ai:spring-ai-tika-document-reader'
}

Example:

public void run() throws MalformedURLException {
        List<Document> documents = new ArrayList<>();

        logger.info("Loading .html files as Documents");
        var documentUri = URI.create("https://docs.spring.io/spring-ai/reference/1.0-SNAPSHOT/concepts.html#_models");
        var htmlReader = new TikaDocumentReader(new UrlResource(documentUri));
        documents.addAll(htmlReader.get());

        logger.info("Creating and storing Embeddings from Documents");
        var textSplitter = new TokenTextSplitter();
        vectorStore.add(textSplitter.split(documents));

        var similarDocuments = vectorStore.similaritySearch(SearchRequest
                .query("Retrieval Augmented Generation")
                .withTopK(3)
                .withSimilarityThreshold(0.75));
        similarDocuments.forEach(doc -> System.out.println(doc.getContent()));
}

@sivaprasadreddy
Copy link

sivaprasadreddy commented Jun 29, 2024

There are some commons-compress version incompatibilities. I had to exclude and configure it as follows:

 <dependency>
      <groupId>org.springframework.ai</groupId>
      <artifactId>spring-ai-tika-document-reader</artifactId>
      <exclusions>
          <exclusion>
              <groupId>org.apache.commons</groupId>
              <artifactId>commons-compress</artifactId>
          </exclusion>
      </exclusions>
  </dependency>
  <dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-compress</artifactId>
      <version>1.26.1</version>
  </dependency>

@markpollack
Copy link
Member

markpollack commented Jul 22, 2024

We can include these changes to the pom.

How much more dedicated support over Tika is expected? The sample code reads well to me

@markpollack markpollack added this to the 1.0.0-M2 milestone Jul 22, 2024
@markpollack markpollack self-assigned this Jul 22, 2024
@markpollack markpollack changed the title [Feature] Web scraping (url document loader functionality) Web scraping ETL Jul 22, 2024
@markpollack markpollack modified the milestones: 1.0.0-M2, 1.0.0-RC1 Sep 4, 2024
@csterwa csterwa removed this from the 1.0.0-RC1 milestone Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants