每日阅读 · Enyalië

reading - 这篇文章属于一个选集。

§ : 本文

Databricks
#

对于目前高性能小模型的部署，常常因为Batch Size不够大或者计算逻辑简单导致GPU计算单元闲置。解决方案是NVIDIA MPS（Multi-Process Service）。它允许多个进程通过将各自的CUDA内核复用到硬件上，从而更高效地共享单个GPU。即它将多个Host端的CUDA Context映射到了Device端的一个Context中，从而绕过了GPU硬件调度器的时间片轮转机制。

Netflix TechBlog
#

netflixtechblog.com ↗

原文链接

The AI Evolution of Graph Search at Netflix: From Structured Queries to Natural Language

这篇介绍的是Netflix中图搜索的AI演进。

Netflix的数据是一个图结构，可以想象类似于一个知识图谱，节点是电影、剧集、导演、演员等等，边是出演、执导、属于、改编自等等这些关系。

传统的方式是这样：

但现在的目标是使用自然语言查询。这一过程中的核心是将用户的自然语言问题转换为结构化查询。它必须要满足句法正确、语义正确以及实用正确。

过滤器生成任务的准备工作是构建上下文，LLM需要访问索引的字段及元数据才能构建语义正确的过滤器。但他们并不会将索引的全部字段提供，而是通过RAG将上下文适当整理。

字段RAG的过程是：

为索引字段及其元数据（名称、描述、类型）创建嵌入向量，并将其索引到向量存储中。
在生成过滤器时，采用重叠策略对用户问题进行分块处理。针对每个分块，执行向量搜索以识别前K个最相关的值及其所属字段。
去重处理：在将各分块的前K个字段作为上下文提供给系统指令之前，会先进行整合与去重操作。

受控词汇表是一种特定的字段类型，它由一组有限且特定的词组成。因此对它需要专门判断。受控词汇表RAG的过程是：

为受控词汇值及其元数据创建嵌入向量，并将这些向量索引到向量存储中。受控词汇可通过GraphQL获取，并会定期抓取和重新索引，以确保该系统能跟上领域内的任何变化。
在生成筛选条件时，用户的问题会被分块处理。针对每个块，执行向量搜索以识别前K个最相关的值（但仅限于与索引中字段相关联的受控词汇）。
每个块中的前K个值会按其受控词汇类型进行去重。随后，匹配到的值及其关联的字段定义会被注入到上下文中。

所以借助RAG工具的整体过程是：

后面就是给LLM指令以及验证输出。

另外一个可以注意的地方是，文章提到他们采用两种方案处理歧义，共同点都是与用户交互：

One way we are handling ambiguity is by showing our work. We visualise the generated filters in the UI in a user-friendly way allowing them to very clearly see if the answer we’re returning is what they were looking for so they can trust the results.
We cannot show a raw DSL string (e.g., origin.country == ‘Germany’ AND genre.tags CONTAINS ‘Time Travel’ AND synopsisKeywords LIKE ‘cave’) to a non-technical user. Instead, we reflect its underlying AST into UI components.
After the LLM generates a filter statement, we parse it into an AST, and then map that AST to the existing “Chips” and “Facets” in our UI (see below). If the LLM generates a filter for origin.country == ‘Germany’, the user sees the “Country” dropdown pre-selected to “Germany.” This gives users immediate visual feedback and the ability to easily fine-tune the query using standard UI controls when the results need improvement or further experimentation.
Another strategy we’ve developed to remove ambiguity happens at query time. We give users the ability to constrain their input to refer to known entities using “@mentions”. Similar to Slack, typing @ lets them search for entities directly from our specialized UI Graph Search component, giving them easy access to multiple controlled vocabularies (plus other identifying metadata like launch year) to feel confident they’re choosing the entity they intend.
If a user types, “When was @dark produced”, we explicitly know they are referring to the Series controlled vocabulary, allowing us to bypass the RAG inference step and hard-code that context, significantly increasing pragmatic correctness (and building user trust in the process).