ballroom

YouTube video organization with AI categorization

Jun 1, 2025Coming Soon

Ballroom is a YouTube video organizer. Connect your Google account, sync your liked videos, and it uses AI to automatically categorize them into folders. The problem it solves is simple: YouTube's liked videos list is a flat, unsearchable timeline. After a few hundred likes, it becomes useless for finding anything. Ballroom gives you categories, search, and unlike tracking.

Chain-of-thought categorization over embeddings

The AI categorization uses Gemini 2.5 Flash Lite via the Vercel AI SDK with a chain-of-thought prompt. The model first identifies the content type (entertainment, educational, music, news), then assigns a category from the user's list. This two-step reasoning matters because ambiguous content is common — a Bill Burr rant about the gym is entertainment, not fitness. A single-shot prompt gets this wrong surprisingly often. The chain-of-thought approach forces the model to reason about what the video is before deciding where it goes.

const categorizationSchema = z.object({
  categorizations: z.array(
    z.object({
      videoId: z.string(),
      contentType: z.enum([
        'entertainment', 'educational', 'music', 'news', 'other',
      ]).describe('Primary content type/format of this video'),
      reasoning: z.string().describe('Brief explanation of why'),
      categoryId: z.string().describe('Final category ID assignment'),
    })
  ),
});

// The prompt enforces chain-of-thought:
// 1. Identify content TYPE (entertainment, educational, music, ...)
// 2. THEN assign a category
//
// Priority hierarchy:
//   Creator/Channel type > Primary purpose > Topic mentioned
//
// Examples in the prompt:
//   Bill Burr rant about gym → Entertainment, NOT Fitness
//   AthleanX workout tutorial → Educational → Fitness
//   MKBHD phone review → Educational → Tech

The priority hierarchy is explicit in the prompt: creator/channel type trumps primary purpose, which trumps topic mentioned. The model returns structured output (Zod-validated), and any video where the AI returns an invalid category ID stays uncategorized rather than being assigned to a phantom category. Videos are categorized in batches of 10 per API call to keep token costs reasonable.

Progressive sync strategy

There are three sync modes that share the same progressive sync algorithm with different depth limits. Full sync (up to 10k videos) runs on onboarding to pull in the user's entire library. Quick sync (50 videos) runs hourly via a Trigger.dev cron to catch new likes. Extended sync (500 videos) is manual, for when a user wants a deeper refresh.

async function progressiveSync(
  userId: string,
  options: ProgressiveSyncOptions = {}
): Promise<SyncResult> {
  const batchSize = 50;
  const consecutiveThreshold = 2;
  let totalFetched = 0;
  let consecutiveExistingBatches = 0;

  while (totalFetched < opts.maxDepth) {
    const { videos: fetchedVideos, nextPageToken } =
      await fetchLikedVideos(userId, batchSize, pageToken);

    if (fetchedVideos.length === 0) {
      reachedEnd = true;
      break;
    }

    const insertedCount = await syncBatchToDatabase({
      userId,
      fetchedVideos,
      syncStartedAt,
    });

    // Smart stopping: if we've passed initialLimit
    // and hit 2+ consecutive batches of already-known videos, stop
    if (
      totalFetched >= opts.initialLimit &&
      consecutiveExistingBatches >= consecutiveThreshold
    ) {
      break;
    }

    totalFetched += fetchedVideos.length;
    pageToken = nextPageToken;
  }

  // If we reached the end, mark unseen videos as unliked
  if (reachedEnd) {
    await markVideosAsUnliked(userId, syncStartedAt);
  }
}

// Three sync modes — same algorithm, different depth limits
export const quickSync    = (userId) => progressiveSync(userId, { initialLimit: 50 });
export const extendedSync = (userId) => progressiveSync(userId, { initialLimit: 500 });
export const fullSync     = (userId) => progressiveSync(userId, { initialLimit: 10_000 });

The smart stopping logic is the key optimization: quick sync stops early if it finds 2 or more consecutive batches of already-known videos. There's no point fetching your entire history every hour when you only liked 3 new videos since the last sync. Upsert uses onConflictDoNothing() on the (userId, youtubeId) unique constraint to handle duplicates across paginated API responses.

Atomic quota reservation

Before any sync or AI call, the system atomically reserves quota using a DB transaction with row-level locking. The trick is putting the quota check in the WHERE clause of the UPDATE — not as a separate SELECT. This means two concurrent requests can't both pass the "enough quota?" check and then both decrement. If the WHERE doesn't match (quota would be exceeded), the update returns zero rows and the code throws immediately.

export async function reserveQuota(
  userId: string,
  quotaType: QuotaType,
  amount: number
): Promise<{ reserved: number; newUsed: number; limit: number }> {
  return db.transaction(async (tx) => {
    // Step 1: Auto-reset quota if past reset date
    await tx
      .update(user)
      .set(buildQuotaResetValues(nextResetDate))
      .where(buildQuotaResetNeededCondition(userId));

    // Step 2: Atomic increment with check in WHERE clause
    // Row-level lock prevents two concurrent requests
    // from both passing the "enough quota?" check
    const update = await tx
      .update(user)
      .set(buildQuotaIncrementUpdate(quotaType, amount))
      .where(
        and(
          eq(user.id, userId),
          // This is the key: check happens atomically with the update
          sql`${user.syncQuotaUsed} + ${amount} <= ${user.syncQuotaLimit}`
        )
      )
      .returning({ used: user.syncQuotaUsed, limit: user.syncQuotaLimit });

    if (update.length === 1) {
      return { reserved: amount, newUsed: update[0].used, limit: update[0].limit };
    }

    // If WHERE didn't match, quota would be exceeded
    throw new AppError({ code: 'QUOTA_EXCEEDED', message: 'Resets monthly.' });
  });
}

Once reserved, quota is consumed even if the downstream AI call fails. I considered a "refund on failure" approach but it adds complexity for little gain — failed calls are rare, and quota resets monthly anyway. The reset happens automatically: the first reservation after the 1st of each month (UTC) resets the counters before checking.

Unlike tracking via `lastSeenAt`

The YouTube API doesn't have a "recently unliked" endpoint. Instead, I track unlikes implicitly. Every video fetched during a sync gets its lastSeenAt updated to the sync start time. When a full sync completes (no more pages from the API), any video whose lastSeenAt is older than the sync start was not in the user's liked videos — they unliked it between syncs.

// After reaching the end of YouTube's API results (no more pages),
// any video whose lastSeenAt is older than syncStartedAt was not
// in the user's liked videos anymore — they unliked it.
async function markVideosAsUnliked(userId: string, syncStartedAt: Date) {
  await db
    .update(videos)
    .set({ syncStatus: 'unliked' })
    .where(
      and(
        eq(videos.userId, userId),
        eq(videos.syncStatus, 'active'),
        or(
          isNull(videos.lastSeenAt),
          lt(videos.lastSeenAt, syncStartedAt)
        )
      )
    );
}

// During sync, every fetched video gets its lastSeenAt updated:
await tx
  .update(videos)
  .set({
    lastSeenAt: sql`COALESCE(GREATEST(${videos.lastSeenAt}, ${syncStartedAt}), ${syncStartedAt})`,
    syncStatus: 'active',
  })
  .where(and(eq(videos.userId, userId), inArray(videos.youtubeId, youtubeIds)));

This only runs on full syncs that reach the end of the API results. Quick syncs that stop early (via the consecutive-batch heuristic) don't mark anything as unliked, since they haven't seen the full library.

PostgreSQL full-text search over Elasticsearch

Search uses a GIN index on a weighted tsvector: title gets weight A (highest), description gets B, channel name gets C. The simple text search config avoids stemming — important because YouTube content is full of brand names, abbreviations, and mixed-language text that stemmers mangle. Prefix matching with :* means typing "bil" matches "bill burr" immediately.

// Weighted tsvector — title matters most, channel name least
export function createVideoSearchVector(
  title: AnyPgColumn,
  description: AnyPgColumn,
  channelName: AnyPgColumn
): SQL {
  return sql`(
    setweight(to_tsvector('simple', COALESCE(${title}, '')), 'A') ||
    setweight(to_tsvector('simple', COALESCE(${description}, '')), 'B') ||
    setweight(to_tsvector('simple', COALESCE(${channelName}, '')), 'C')
  )`;
}

// GIN index on the computed vector
index('idx_videos_search_vector').using(
  'gin',
  createVideoSearchVector(table.title, table.description, table.channelName)
)

// Query with prefix matching — "bil" matches "bill burr"
// Uses 'simple' config: no stemming, preserves brand names and mixed-language text
.where(
  sql`${searchVector} @@ to_tsquery('simple', ${term + ':*'})`
)

The alternative was Elasticsearch or a vector search service, but those add infrastructure complexity for what's a personal-scale library. A user's liked videos might be a few thousand rows at most — PostgreSQL handles that without breaking a sweat.

Background jobs with Trigger.dev

Syncs run as background jobs orchestrated by Trigger.dev. The critical detail is the concurrencyKey set to the user's ID. This creates a per-user queue partition, ensuring only one sync job runs per user at a time. Without this, two concurrent syncs for the same user would corrupt each other's lastSeenAt timestamps, breaking unlike detection.

// Per-user concurrency: each user gets their own queue partition
export async function triggerInitialSync(userId: string) {
  return initialSyncTask.trigger(
    { userId },
    { concurrencyKey: userId }
  );
}

// Hourly cron processes all onboarded users in batches
export const hourlySyncSchedule = schedules.task({
  id: 'hourly-sync',
  cron: '0 * * * *',
  run: async () => {
    let lastUserId: string | undefined;

    while (true) {
      const userBatch = await db
        .select({ id: user.id })
        .from(user)
        .where(
          lastUserId
            ? and(isNotNull(user.onboardedAt), gt(user.id, lastUserId))
            : isNotNull(user.onboardedAt)
        )
        .orderBy(asc(user.id))
        .limit(500);

      if (userBatch.length === 0) break;

      await batchTriggerIncrementalSync(
        userBatch.map((u) => u.id)
      );

      lastUserId = userBatch[userBatch.length - 1]?.id;
    }
  },
});

The hourly cron processes users in batches of 500, paginating by ID to avoid loading all users into memory. Each batch uses batchTrigger to enqueue all syncs at once. Retries are built in: 3 attempts with exponential backoff (1-10s, randomized) per task.

The details

YouTube API quota is 10k units/day shared across all users. The code distinguishes between 403 quota errors and 429 rate limits — different causes, different recovery strategies. Exponential backoff (1-30s) handles transient rate limits. If the daily quota is truly exhausted, syncs are deferred until the next day.

Token refresh is handled automatically by Google's OAuth2Client. A tokens event listener persists the new access token to the database with retry logic (3 attempts, exponential backoff). If refresh fails entirely — revoked token, expired refresh token — the user sees a re-authentication prompt.

const oauth2Client = new OAuth2Client(
  env.GOOGLE_CLIENT_ID,
  env.GOOGLE_CLIENT_SECRET
);

oauth2Client.setCredentials({
  access_token: acc.accessToken,
  refresh_token: acc.refreshToken,
  expiry_date: acc.accessTokenExpiresAt?.getTime(),
});

// google-auth-library auto-refreshes expired tokens
// This listener persists the new token to the database
oauth2Client.on('tokens', async (tokens) => {
  if (!tokens.access_token) return;

  // Retry with exponential backoff — 100ms, 200ms, 400ms
  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      await db
        .update(account)
        .set({
          accessToken: tokens.access_token,
          accessTokenExpiresAt: tokens.expiry_date
            ? new Date(tokens.expiry_date)
            : undefined,
        })
        .where(eq(account.id, acc.id));
      return;
    } catch (error) {
      if (attempt < 2) {
        await new Promise((r) => setTimeout(r, 100 * Math.pow(2, attempt)));
      }
    }
  }
  // If all retries fail, user may need to re-authenticate
});

Category changes trigger re-analysis. If a user renames a category or adds a new one, all videos where lastAnalyzedAt is older than the latest category update get re-categorized. This avoids stale categorizations where videos are assigned to categories that no longer make sense or where a better-fit category now exists.

// When a user renames or adds categories, all videos categorized
// before that change need to be re-analyzed against the new category list
const videosToAnalyze = await db
  .select()
  .from(videos)
  .where(
    and(
      eq(videos.userId, userId),
      eq(videos.syncStatus, 'active'),
      or(
        isNull(videos.lastAnalyzedAt),
        lt(videos.lastAnalyzedAt, latestCategoryUpdate)
      )
    )
  );

Auth scopes: Better-Auth requests youtube.readonly scope plus offline access for refresh tokens. The prompt: 'consent' flag ensures users always see what they're granting, even if they've signed in before — important because YouTube access is more sensitive than typical OAuth.

IDs with prefixes: All database IDs use nanoid with a prefix — vid_abc123, cat_xyz789. This makes debugging significantly easier. When you see an ID in logs or error messages, you immediately know what kind of entity it refers to.