milkpod

AI video transcription and Q&A workspace

Mar 1, 2026

Milkpod is an AI video transcription and Q&A workspace. Paste a YouTube link, get a full transcript with timestamps and speaker labels, then ask questions and get answers grounded in the actual content. It also generates highlight moments — ranked clips extracted using a multi-signal scoring system.

Architecture

The frontend is Next.js 16 (App Router + React 19). The backend is an Elysia server with Eden Treaty for end-to-end type-safe API calls. Auth is Better Auth with cookie-based sessions. Everything lives in a Turborepo monorepo with shared packages for the API layer, AI logic, auth, and database (Drizzle ORM + PostgreSQL with pgvector).

The AI package (@milkpod/ai) owns all model interactions and is structured with strict tree-shaking boundaries — server-only modules (streaming, retrieval, embeddings) are isolated from client-safe exports (model registry, types, schemas) so the Next.js bundle never pulls in Node.js dependencies.

Streaming chat with tool use

The chat endpoint validates messages, runs an input guardrail (LLM classifier with regex fallback for jailbreak detection), resolves the user's chosen model, and streams a response. The model has access to three tools for RAG: retrieve_segments for semantic search, read_transcript for full-transcript synthesis, and get_transcript_context for expanding around a known timestamp. Up to 5 agentic steps, 120s total timeout.

export async function createChatStream(req: ChatRequest): Promise<Response> {
  const tools = createQAToolSet({
    assetId: req.assetId,
    collectionId: req.collectionId,
  });

  const validatedMessages = await validateUIMessages<MilkpodMessage>({
    messages: req.messages,
    metadataSchema: chatMetadataSchema.optional(),
    tools,
  });

  const guardrailResult = await checkInput(validatedMessages);
  if (!guardrailResult.allowed) {
    return createRefusalResponse(validatedMessages, req.headers);
  }

  const model = resolveModel(parsedModelId);

  const result = streamText({
    model,
    system: buildSystemPrompt({
      assetId: req.assetId,
      assetTitle: req.assetTitle,
      collectionId: req.collectionId,
      wordLimit: effectiveWordLimit,
    }),
    messages: await convertToModelMessages(validatedMessages),
    tools,
    maxOutputTokens: wordLimitToMaxTokens(effectiveWordLimit),
    stopWhen: [stepCountIs(5)],
    timeout: { totalMs: 120_000, chunkMs: 30_000 },
    onFinish: async ({ steps, text }) => {
      const wordCount = text.split(/\s+/).filter(Boolean).length;
      await req.onFinish?.({ responseMessage: buildResponseMessage(steps), wordCount });
    },
  });

  return result.toUIMessageStreamResponse<MilkpodMessage>({
    headers: req.headers,
    sendReasoning: true,
    originalMessages: validatedMessages,
  });
}

Seven models are available — GPT-5.2 (default), GPT-4.1, GPT-4.1 Mini, o4-mini, Gemini 2.5 Pro, Gemini 2.5 Flash, and Gemini 2.0 Flash. The user picks from the UI; the backend resolves the provider at runtime.

RAG with pgvector

Transcript segments are chunked (recursive character splitting, 2800 char max, 200 char overlap) and embedded with text-embedding-3-small. Vectors are stored in a pgvector column with an HNSW index for fast cosine similarity search. The retrieve_segments tool embeds the query, finds the 8 most similar segments above a 0.3 similarity threshold, and returns them with timestamps and speaker labels.

export async function findRelevantSegments(
  query: string,
  options: RetrievalOptions = {}
): Promise<RelevantSegment[]> {
  const { assetId, collectionId, limit = 10, minSimilarity = 0.3 } = options;

  const queryEmbedding = await generateEmbedding(query);
  const similarity = sql<number>`1 - (${cosineDistance(
    embeddings.embedding, queryEmbedding
  )})`;

  let queryBuilder = db()
    .select({
      segmentId: transcriptSegments.id,
      text: transcriptSegments.text,
      startTime: transcriptSegments.startTime,
      endTime: transcriptSegments.endTime,
      speaker: transcriptSegments.speaker,
      transcriptId: transcriptSegments.transcriptId,
      similarity,
    })
    .from(embeddings)
    .innerJoin(transcriptSegments, eq(embeddings.segmentId, transcriptSegments.id))
    .innerJoin(transcripts, eq(transcriptSegments.transcriptId, transcripts.id));

  if (collectionId) {
    queryBuilder = queryBuilder.innerJoin(
      collectionItems, eq(collectionItems.assetId, transcripts.assetId)
    );
    conditions.push(eq(collectionItems.collectionId, collectionId));
  }

  return queryBuilder
    .where(and(...conditions))
    .orderBy(desc(similarity))
    .limit(limit);
}

For broad tasks like "summarize this video," read_transcript returns an evenly-sampled overview (up to 60 segments via modulo sampling on segment index) instead of doing vector search. The system prompt teaches the model when to use which tool.

Word quota with advisory locks

Each user gets 2,000 words per day (1,500 max per request). Before streaming, words are reserved atomically using PostgreSQL advisory locks — pg_advisory_xact_lock serializes concurrent requests for the same user so two parallel chats can't double-spend the budget. After streaming, unused words are released back.

static async reserveWords(userId: string, wordCount: number): Promise<number> {
  const today = todayUTC();
  const toReserve = Math.max(0, wordCount);
  if (toReserve === 0) return 0;

  return await db().transaction(async (tx) => {
    // Advisory lock serializes all usage operations for this user.
    // Prevents concurrent requests from double-spending the quota.
    await tx.execute(
      sql`SELECT pg_advisory_xact_lock(hashtext(${`usage:${userId}`}))`
    );

    const [row] = await tx
      .select({ wordsUsed: dailyUsage.wordsUsed })
      .from(dailyUsage)
      .where(and(eq(dailyUsage.userId, userId), eq(dailyUsage.usageDate, today)));

    const currentUsed = row?.wordsUsed ?? 0;
    const available = Math.max(0, DAILY_WORD_BUDGET - currentUsed);
    const toAdd = Math.min(toReserve, available);

    if (toAdd <= 0) return 0;

    if (row) {
      await tx.update(dailyUsage).set({ wordsUsed: currentUsed + toAdd })
        .where(and(eq(dailyUsage.userId, userId), eq(dailyUsage.usageDate, today)));
    } else {
      await tx.insert(dailyUsage).values({ userId, usageDate: today, wordsUsed: toAdd });
    }

    return toAdd;
  });
}

Multi-signal moment extraction

Highlight moments use a three-signal ranking system. First, each transcript chunk is sent to a fast model to extract candidates with confidence and goal-fit scores. Overlapping candidates (>50% time overlap) are merged. Then two more signals are layered on: structural heuristics (cue phrase density, lexical density, sentence density) and QA evidence (how often segments were referenced in past user questions, weighted by retrieval relevance).

// Final composite score — three signals, weighted
const finalScore =
  0.45 * llmScore +
  0.35 * qaSignal +
  0.20 * structuralScore;

// Preset-specific boosts (e.g. "hook" boosts early video, "actionable" boosts imperatives)
// Then sort by finalScore, take top 10

Six presets shape the extraction: default (balanced), hook (attention grabbers, boosts early video), insight (aha moments), quote (shareable lines), actionable (concrete steps), and story (emotional peaks). Each preset gets a different system prompt and post-ranking boost.

Ingest pipeline

Video ingestion is fire-and-forget. The endpoint returns immediately after creating the asset record; a background pipeline handles transcription (YouTube captions via Innertube API with Android client spoofing), embedding generation (batched 64 at a time), and status updates via WebSocket events. The UI shows real-time progress as the asset moves through queued transcribingembedding ready.

The hard parts

Cross-origin auth on Railway. The frontend and backend deploy to different Railway services. Cookies with sameSite: 'none' are increasingly blocked by browsers. The fix was putting both services under the same root domain so cookies become first-party with sameSite: 'lax'.

Token bucket rate limiting is per-user (or per-IP for unauthenticated routes) with four tiers: ingest (10/min), chat (30/min), CRUD (100/min), and auth (10/min). Buckets auto-cleanup after 10 minutes of inactivity.

Evidence tracking closes the loop between chat and moments. Every segment the retrieval tools return is logged in a qa_evidence table with its relevance score. When moments are generated later, segments that users actually asked about get boosted — the system learns which parts of a video matter from real usage.