Download all docs

Recipe: Knowledge Graph from Documents

This recipe turns a pile of documents into a queryable knowledge graph. You ingest source material into a document store, extract the people and companies into entity and organizations records, then wire those records together as nodes and edges in a graph you can traverse.

The problem it solves

Documents are good at holding prose and terrible at answering “who is connected to whom”. Once you have hundreds of reports, contracts, or notes, the relationships between the people and organizations named in them are invisible. This recipe lifts those relationships out of the text and into a graph, so you can walk from a person to their company to that company’s other contacts in one traversal.

Elements

ElementRole
documentStores and searches the source material as semi-structured pages.
entityA record of a person, AI, or other actor named in the documents.
organizationsA record of a company, agency, or school your entities belong to.
graphNodes, edges, and traversals over the extracted relationships (backed by Apache AGE).
python (or any action)Extraction step that reads pages and upserts records and edges.

Flow

  1. Create a document store. Load source material with write (create-or-update a page) or import, then make it searchable with vectorize and query it with search.
  2. Create an entity store and an organizations store. As you extract names, upsert a record for each person and each company. Both elements carry rich identity — add_email, add_phone, add_address, add_social — and organizations expose find_by_domain and find_by_org_no so you can de-duplicate on a stable key instead of on a fuzzy name.
  3. Link people to companies with the organizations element’s list_contacts model (the contacts who work at an org).
  4. Create a graph. For each entity and organization, add-node with a label (e.g. Person, Company) and properties; for each relationship you extracted (works-at, mentioned-with, reports-to), add-edge between the two nodes.
  5. Do the extraction in an action element — a python step that pulls pages from the document search, parses out the named entities, and calls upsert plus add-node/add-edge. Attach the data elements to the action so it has their connections.
  6. Ask questions with the graph’s traverse (breadth- or depth-first from a starting node, bounded by depth), shortest-path between two nodes, or query for a labelled lookup.

What this shows

The same facts live in three shapes, each good at a different question: document answers “what does the source say”, entity / organizations answer “what is the canonical record for this person or company”, and graph answers “how are they connected”. Because the entity and organization records carry verified contact identity, the graph you build on top of them is anchored to real, de-duplicated nodes rather than to raw strings scraped from prose.

Next pages