Web scraping ETL #752

iAMSagar44 · 2024-05-23T02:39:57Z

Is there a feature in the pipeline to support web scraping functionality - similar to what the LangChain library has to offer (https://python.langchain.com/v0.1/docs/use_cases/web_scraping/).

It is basically to load HTML pages from a web url and transform it to text, before chunking and indexing it to a Vector Store.

ThomasVitale · 2024-05-23T08:03:25Z

You can already load web pages into a vector database using the Tika DocumentReader, but it would be great to have dedicated support for the web scraping use case. For example, it would be great having the possibility to customise the loading and transformation/splitting of web pages in an HTML-aware way (similar to what LangChain and LlamaIndex support.

Dependency:

dependencies {
	...
	implementation 'org.springframework.ai:spring-ai-tika-document-reader'
}

Example:

public void run() throws MalformedURLException {
        List<Document> documents = new ArrayList<>();

        logger.info("Loading .html files as Documents");
        var documentUri = URI.create("https://docs.spring.io/spring-ai/reference/1.0-SNAPSHOT/concepts.html#_models");
        var htmlReader = new TikaDocumentReader(new UrlResource(documentUri));
        documents.addAll(htmlReader.get());

        logger.info("Creating and storing Embeddings from Documents");
        var textSplitter = new TokenTextSplitter();
        vectorStore.add(textSplitter.split(documents));

        var similarDocuments = vectorStore.similaritySearch(SearchRequest
                .query("Retrieval Augmented Generation")
                .withTopK(3)
                .withSimilarityThreshold(0.75));
        similarDocuments.forEach(doc -> System.out.println(doc.getContent()));
}

sivaprasadreddy · 2024-06-29T04:05:54Z

There are some commons-compress version incompatibilities. I had to exclude and configure it as follows:

 <dependency>
      <groupId>org.springframework.ai</groupId>
      <artifactId>spring-ai-tika-document-reader</artifactId>
      <exclusions>
          <exclusion>
              <groupId>org.apache.commons</groupId>
              <artifactId>commons-compress</artifactId>
          </exclusion>
      </exclusions>
  </dependency>
  <dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-compress</artifactId>
      <version>1.26.1</version>
  </dependency>

markpollack · 2024-07-22T22:20:46Z

We can include these changes to the pom.

How much more dedicated support over Tika is expected? The sample code reads well to me

iAMSagar44 changed the title ~~Web scraping (url document loader functionality)~~ [Feature] Web scraping (url document loader functionality) May 23, 2024

markpollack added this to the 1.0.0-M2 milestone Jul 22, 2024

markpollack self-assigned this Jul 22, 2024

markpollack changed the title ~~[Feature] Web scraping (url document loader functionality)~~ Web scraping ETL Jul 22, 2024

markpollack added the ETL label Jul 22, 2024

youngmoneee mentioned this issue Aug 14, 2024

Limitations of ETL Pipelines and Design Proposals for Improvement #1219

Open

markpollack modified the milestones: 1.0.0-M2, 1.0.0-RC1 Sep 4, 2024

csterwa added the next priorities label Sep 10, 2024

csterwa removed this from the 1.0.0-RC1 milestone Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web scraping ETL #752

Web scraping ETL #752

iAMSagar44 commented May 23, 2024

ThomasVitale commented May 23, 2024

sivaprasadreddy commented Jun 29, 2024 •

edited

Loading

markpollack commented Jul 22, 2024 •

edited

Loading

Web scraping ETL #752

Web scraping ETL #752

Comments

iAMSagar44 commented May 23, 2024

ThomasVitale commented May 23, 2024

sivaprasadreddy commented Jun 29, 2024 • edited Loading

markpollack commented Jul 22, 2024 • edited Loading

sivaprasadreddy commented Jun 29, 2024 •

edited

Loading

markpollack commented Jul 22, 2024 •

edited

Loading