feat(ee): Add a service to fetch website content and prepare a persona of Captain Assistant (#12732)

This PR is the first of many to simplify the process of building an
assistant. The new flow will only require the user’s website. We’ll
automatically crawl it, identify the business name and what the business
does, and then generate a suggested assistant persona, complete with a
proposed name and description.

This service returns the following.
Example: tooljet.com
<img width="795" height="217" alt="Screenshot 2025-10-25 at 2 55 04 PM"
src="https://github.com/user-attachments/assets/9cb3594a-9c9c-4970-a0a1-4c9c8869c193"
/>

Example: replit.com
<img width="797" height="176" alt="Screenshot 2025-10-25 at 2 56 42 PM"
src="https://github.com/user-attachments/assets/6a1b4266-aab6-455f-a5e3-696d3a8243c9"
/>
This commit is contained in:
Pranav
2025-10-25 15:50:50 -07:00
committed by GitHub
parent b9864fe1f6
commit 5891fd6f49
5 changed files with 310 additions and 0 deletions

View File

@@ -19,6 +19,20 @@ class Captain::Tools::SimplePageCrawlService
ReverseMarkdown.convert @doc.at_xpath('//body'), unknown_tags: :bypass, github_flavored: true
end
def meta_description
meta_desc = @doc.at_css('meta[name="description"]')
return nil unless meta_desc && meta_desc['content']
meta_desc['content'].strip
end
def favicon_url
favicon_link = @doc.at_css('link[rel*="icon"]')
return nil unless favicon_link && favicon_link['href']
resolve_url(favicon_link['href'])
end
private
def sitemap?
@@ -35,4 +49,12 @@ class Captain::Tools::SimplePageCrawlService
absolute_url
end
end
def resolve_url(url)
return url if url.start_with?('http')
URI.join(@external_link, url).to_s
rescue StandardError
url
end
end