feat: Add company backfill migration for existing contacts (Part 1) (#12657)

## Description

Implements company backfill migration infrastructure for existing
contacts. This is **Part 1 of 2** for the company model production
rollout as described in
[CW-5726](https://linear.app/chatwoot/issue/CW-5726/company-model-setting-it-up-on-production).

Creates jobs and services to associate existing contacts with companies
based on their email domains, filtering out free email providers (gmail,
yahoo, etc.) and disposable addresses.
 

**What's included:**
- Business email detector service with ValidEmail2 (uses
`disposable_domain?` to avoid DNS lookups)
- Per-account batch job to process contacts for one account
- Orchestrator job to iterate all accounts
- Rake task: `bundle exec rake companies:backfill`

~~*NOTE*: I'm using a hard-coded approach to determine if something is a
"business" email by filtering out emails that are usually personal. I've
also added domains that are common to some of our customers' regions.
This should be simpler. I looked into `Valid_Email2` and I couldn't find
anything to dictate whether an email is a personal email or a business
one. I don't think the approach used in the frontend is valid here.~~
UPDATE: Using `email_provider_info` gem instead.


**Pending - Part 2 (separate PR):** Real-time company creation for new
contacts

## Type of change

- [x] New feature (non-breaking change which adds functionality)

## How Has This Been Tested?

```bash
# Run all new tests
bundle exec rspec spec/enterprise/services/companies/business_email_detector_service_spec.rb \\
                   spec/enterprise/jobs/migration/company_account_batch_job_spec.rb \\
                   spec/enterprise/jobs/migration/company_backfill_job_spec.rb

# Run RuboCop
bundle exec rubocop enterprise/app/services/companies/business_email_detector_service.rb \\
                     enterprise/app/jobs/migration/company_account_batch_job.rb \\
                     enterprise/app/jobs/migration/company_backfill_job.rb \\
                     lib/tasks/companies.rake
```

**Performance optimization:**
- Uses `disposable_domain?` instead of `disposable?` to avoid DNS MX
lookups (discovered via tcpdump analysis - `disposable?` was making
network calls for every email, causing 100x slowdown)

## Checklist:

- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my code
- [x] I have commented on my code, particularly in hard-to-understand
areas
- [ ] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] New and existing unit tests pass locally with my changes
- [ ] Any dependent changes have been merged and published in downstream
modules

---------

Co-authored-by: Sojan Jose <sojan@pepalo.com>
This commit is contained in:
Vinay Keerthi
2025-11-03 20:03:47 +05:30
committed by GitHub
parent e771d99552
commit ef54f07d5b
12 changed files with 376 additions and 4 deletions

View File

@@ -21,6 +21,7 @@ gem 'telephone_number'
gem 'time_diff'
gem 'tzinfo-data'
gem 'valid_email2'
gem 'email-provider-info'
# compress javascript config.assets.js_compressor
gem 'uglifier'
##-- used for single column multiple binary flags in notification settings/feature flagging --##

View File

@@ -270,6 +270,7 @@ GEM
concurrent-ruby (~> 1.0)
http (>= 3.0)
ruby2_keywords
email-provider-info (0.0.1)
email_reply_trimmer (0.1.13)
erubi (1.13.0)
et-orbi (1.2.11)
@@ -1016,6 +1017,7 @@ DEPENDENCIES
dotenv-rails (>= 3.0.0)
down
elastic-apm
email-provider-info
email_reply_trimmer
facebook-messenger
factory_bot_rails (>= 6.4.3)

View File

@@ -0,0 +1,16 @@
class AddUniqueIndexToCompaniesDomain < ActiveRecord::Migration[7.1]
def up
remove_index :companies, name: 'index_companies_on_domain_and_account_id', if_exists: true
add_index :companies, [:account_id, :domain],
unique: true,
name: 'index_companies_on_account_and_domain',
where: 'domain IS NOT NULL'
end
def down
remove_index :companies, name: 'index_companies_on_account_and_domain', if_exists: true
add_index :companies, [:domain, :account_id],
name: 'index_companies_on_domain_and_account_id'
end
end

View File

@@ -577,8 +577,8 @@ ActiveRecord::Schema[7.1].define(version: 2025_10_22_152158) do
t.bigint "account_id", null: false
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.index ["account_id", "domain"], name: "index_companies_on_account_and_domain", unique: true, where: "(domain IS NOT NULL)"
t.index ["account_id"], name: "index_companies_on_account_id"
t.index ["domain", "account_id"], name: "index_companies_on_domain_and_account_id"
t.index ["name", "account_id"], name: "index_companies_on_name_and_account_id"
end

View File

@@ -0,0 +1,54 @@
class Migration::CompanyAccountBatchJob < ApplicationJob
queue_as :low
def perform(account)
account.contacts
.where.not(email: nil)
.find_in_batches(batch_size: 1000) do |contact_batch|
process_contact_batch(contact_batch, account)
end
end
private
def process_contact_batch(contacts, account)
contacts.each do |contact|
next unless should_process?(contact)
company = find_or_create_company(contact, account)
# rubocop:disable Rails/SkipsModelValidations
contact.update_column(:company_id, company.id) if company
# rubocop:enable Rails/SkipsModelValidations
end
end
def should_process?(contact)
return false if contact.company_id.present?
return false if contact.email.blank?
Companies::BusinessEmailDetectorService.new(contact.email).perform
end
def find_or_create_company(contact, account)
domain = extract_domain(contact.email)
company_name = derive_company_name(contact, domain)
Company.find_or_create_by!(account: account, domain: domain) do |company|
company.name = company_name
end
rescue ActiveRecord::RecordNotUnique
# Race condition: Another job created it between our check and create
# just find the one that was created
Company.find_by(account: account, domain: domain)
end
def extract_domain(email)
email.split('@').last&.downcase
end
def derive_company_name(contact, domain)
contact.additional_attributes&.dig('company_name').presence ||
domain.split('.').first.tr('-_', ' ').titleize
end
end

View File

@@ -0,0 +1,17 @@
class Migration::CompanyBackfillJob < ApplicationJob
queue_as :low
def perform
Rails.logger.info 'Starting company backfill migration...'
account_count = 0
Account.find_in_batches(batch_size: 100) do |accounts|
accounts.each do |account|
Rails.logger.info "Enqueuing company backfill for account #{account.id}"
Migration::CompanyAccountBatchJob.perform_later(account)
account_count += 1
end
end
Rails.logger.info "Company backfill migration complete. Enqueued jobs for #{account_count} accounts."
end
end

View File

@@ -12,9 +12,9 @@
#
# Indexes
#
# index_companies_on_account_id (account_id)
# index_companies_on_domain_and_account_id (domain,account_id)
# index_companies_on_name_and_account_id (name,account_id)
# index_companies_on_account_and_domain (account_id,domain) UNIQUE WHERE (domain IS NOT NULL)
# index_companies_on_account_id (account_id)
# index_companies_on_name_and_account_id (name,account_id)
#
class Company < ApplicationRecord
include Avatarable
@@ -24,6 +24,7 @@ class Company < ApplicationRecord
with: /\A[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?)+\z/,
message: I18n.t('errors.companies.domain.invalid')
}
validates :domain, uniqueness: { scope: :account_id }, if: -> { domain.present? }
validates :description, length: { maximum: Limits::COMPANY_DESCRIPTION_LENGTH_LIMIT }
belongs_to :account

View File

@@ -0,0 +1,19 @@
class Companies::BusinessEmailDetectorService
attr_reader :email
def initialize(email)
@email = email
end
def perform
return false if email.blank?
address = ValidEmail2::Address.new(email)
return false unless address.valid?
return false if address.disposable_domain?
provider = EmailProviderInfo.call(email)
provider.nil?
end
end

12
lib/tasks/companies.rake Normal file
View File

@@ -0,0 +1,12 @@
namespace :companies do
desc 'Backfill companies from existing contact email domains'
task backfill: :environment do
puts 'Starting company backfill migration...'
puts 'This will process all accounts and create companies from contact email domains.'
puts 'The job will run in the background via Sidekiq'
puts ''
Migration::CompanyBackfillJob.perform_later
puts 'Company backfill job has been enqueued.'
puts 'Monitor progress in logs or Sidekiq dashboard.'
end
end

View File

@@ -0,0 +1,120 @@
require 'rails_helper'
RSpec.describe Migration::CompanyAccountBatchJob, type: :job do
let(:account) { create(:account) }
describe '#perform' do
before do
# Stub EmailProvideInfo to control behavior in tests
allow(EmailProviderInfo).to receive(:call) do |email|
domain = email.split('@').last&.downcase
case domain
when 'gmail.com', 'yahoo.com', 'hotmail.com', 'uol.com.br'
'free_provider' # generic free provider name
end
end
end
context 'when contact has business email' do
let!(:contact) { create(:contact, account: account, email: 'user@acme.com') }
it 'creates a company and associates the contact' do
expect do
described_class.perform_now(account)
end.to change(Company, :count).by(1)
contact.reload
expect(contact.company).to be_present
expect(contact.company.domain).to eq('acme.com')
expect(contact.company.name).to eq('Acme')
end
end
context 'when contact has free email' do
let!(:contact) { create(:contact, account: account, email: 'user@gmail.com') }
it 'does not create a company' do
expect do
described_class.perform_now(account)
end.not_to change(Company, :count)
contact.reload
expect(contact.company_id).to be_nil
end
end
context 'when contact has company_name in additional_attributes' do
let!(:contact) do
create(:contact, account: account, email: 'user@acme.com', additional_attributes: { 'company_name' => 'Acme Corporation' })
end
it 'uses the saved company name' do
described_class.perform_now(account)
contact.reload
expect(contact.company.name).to eq('Acme Corporation')
end
end
context 'when contact already has a company' do
let!(:existing_company) { create(:company, account: account, domain: 'existing.com') }
let!(:contact) do
create(:contact, account: account, email: 'user@acme.com', company: existing_company)
end
it 'does not change the existing company' do
described_class.perform_now(account)
contact.reload
expect(contact.company_id).to eq(existing_company.id)
end
end
context 'when multiple contacts have the same domain' do
let!(:contact1) { create(:contact, account: account, email: 'user1@acme.com') }
let!(:contact2) { create(:contact, account: account, email: 'user2@acme.com') }
it 'creates only one company for the domain' do
expect do
described_class.perform_now(account)
end.to change(Company, :count).by(1)
contact1.reload
contact2.reload
expect(contact1.company_id).to eq(contact2.company_id)
expect(contact1.company.domain).to eq('acme.com')
end
end
context 'when contact has no email' do
let!(:contact) { create(:contact, account: account, email: nil) }
it 'skips the contact' do
expect do
described_class.perform_now(account)
end.not_to change(Company, :count)
contact.reload
expect(contact.company_id).to be_nil
end
end
context 'when processing large batch' do
before do
contacts_data = Array.new(2000) do |i|
{
account_id: account.id,
email: "user#{i}@company#{i % 100}.com",
name: "User #{i}",
created_at: Time.current,
updated_at: Time.current
}
end
# rubocop:disable Rails/SkipsModelValidations
Contact.insert_all(contacts_data)
# rubocop:enable Rails/SkipsModelValidations
end
it 'processes all contacts in batches' do
expect do
described_class.perform_now(account)
end.to change(Company, :count).by(100)
expect(account.contacts.where.not(company_id: nil).count).to eq(2000)
end
end
end
end

View File

@@ -0,0 +1,31 @@
require 'rails_helper'
RSpec.describe Migration::CompanyBackfillJob, type: :job do
describe '#perform' do
it 'enqueues the job' do
expect { described_class.perform_later }
.to have_enqueued_job(described_class)
.on_queue('low')
end
context 'when accounts exist' do
let!(:account1) { create(:account) }
let!(:account2) { create(:account) }
it 'enqueues CompanyAccountBatchJob for each account' do
expect do
described_class.perform_now
end.to have_enqueued_job(Migration::CompanyAccountBatchJob)
.with(account1)
.and have_enqueued_job(Migration::CompanyAccountBatchJob)
.with(account2)
end
end
context 'when no accounts exist' do
it 'completes without error' do
expect { described_class.perform_now }.not_to raise_error
end
end
end
end

View File

@@ -0,0 +1,99 @@
require 'rails_helper'
RSpec.describe Companies::BusinessEmailDetectorService, type: :service do
let(:service) { described_class.new(email) }
describe '#perform' do
context 'when email is from a business domain' do
let(:email) { 'user@acme.com' }
let(:valid_email_address) { instance_double(ValidEmail2::Address, valid?: true, disposable_domain?: false) }
before do
allow(ValidEmail2::Address).to receive(:new).with(email).and_return(valid_email_address)
allow(EmailProviderInfo).to receive(:call).with(email).and_return(nil)
end
it 'returns true' do
expect(service.perform).to be(true)
end
end
context 'when email is from gmail' do
let(:email) { 'user@gmail.com' }
let(:valid_email_address) { instance_double(ValidEmail2::Address, valid?: true, disposable_domain?: false) }
before do
allow(ValidEmail2::Address).to receive(:new).with(email).and_return(valid_email_address)
allow(EmailProviderInfo).to receive(:call).with(email).and_return('gmail')
end
it 'returns false' do
expect(service.perform).to be(false)
end
end
context 'when email is from Brazilian free provider' do
let(:email) { 'user@uol.com.br' }
let(:valid_email_address) { instance_double(ValidEmail2::Address, valid?: true, disposable_domain?: false) }
before do
allow(ValidEmail2::Address).to receive(:new).with(email).and_return(valid_email_address)
allow(EmailProviderInfo).to receive(:call).with(email).and_return('uol')
end
it 'returns false' do
expect(service.perform).to be(false)
end
end
context 'when email is disposable' do
let(:email) { 'user@mailinator.com' }
let(:disposable_email_address) { instance_double(ValidEmail2::Address, valid?: true, disposable_domain?: true) }
it 'returns false' do
allow(ValidEmail2::Address).to receive(:new).with(email).and_return(disposable_email_address)
expect(service.perform).to be(false)
end
end
context 'when email is invalid format' do
let(:email) { 'invalid-email' }
let(:invalid_email_address) { instance_double(ValidEmail2::Address, valid?: false) }
it 'returns false' do
allow(ValidEmail2::Address).to receive(:new).with(email).and_return(invalid_email_address)
expect(service.perform).to be(false)
end
end
context 'when email is nil' do
let(:email) { nil }
it 'remains false' do
expect(service.perform).to be(false)
end
end
context 'when email is empty string' do
let(:email) { '' }
it 'returns false' do
expect(service.perform).to be(false)
end
end
context 'when email domain is uppercase' do
let(:email) { 'user@GMAIL.COM' }
let(:valid_email_address) { instance_double(ValidEmail2::Address, valid?: true, disposable_domain?: false) }
before do
allow(ValidEmail2::Address).to receive(:new).with(email).and_return(valid_email_address)
allow(EmailProviderInfo).to receive(:call).with(email).and_return('gmail')
end
it 'returns false (case insensitive)' do
expect(service.perform).to be(false)
end
end
end
end