Formal Framework
Disclaimer: The definitions below are proposed and speculative. They represent the current best formulation of PKT’s core ideas, not proven results. Where something is a conjecture rather than a theorem, it is labeled as such.
1. The Knowledge Tensor
Definition. The Knowledge Tensor at time $t$ is a multidimensional array:
\[\mathcal{T}^{(t)} \in \mathbb{R}^{d_1 \times d_2 \times \cdots \times d_n}\]where each mode (dimension) represents a distinct aspect of knowledge:
| Mode | Interpretation | Example |
|---|---|---|
| $d_1$ | Concepts / entities | “electron,” “gravity,” “Paris” |
| $d_2$ | Relations / predicates | “is-a,” “causes,” “located-in” |
| $d_3$ | Confidence / belief strength | $[0, 1]$ — how strongly the relation is held |
| $d_4$ | Temporal context | When the belief was formed or applies |
| $d_5, \ldots, d_n$ | Additional structure | Modality, source, abstraction level |
Each entry $\mathcal{T}^{(t)}_{i_1, i_2, \ldots, i_n}$ represents the belief strength for a specific knowledge claim — e.g., “the concept electron stands in the relation has-charge with confidence 0.97 in the context of quantum mechanics.”
This is deliberately more expressive than a knowledge graph (which is a rank-2 adjacency matrix) or a knowledge base (which stores discrete facts). The tensor captures graded, structured, multi-relational knowledge.
2. Inductive Learning
The inductive process builds $\mathcal{T}$ from data. Given a dataset $\mathcal{D}$, the model learns parameters $\theta$ that populate the tensor:
\[\mathcal{T}^{(t)} = f_\theta(\mathcal{D})\]where $f_\theta$ is a neural encoder (e.g., a transformer or graph neural network). The inductive loss encourages the tensor to faithfully represent the data:
\[\mathcal{L}_{\text{inductive}}(\theta) = -\mathbb{E}_{x \sim \mathcal{D}} \left[ \log p_\theta(x) \right]\]This is standard — any self-supervised or supervised objective fits here. The key point is that inductive learning alone produces a tensor that reflects statistical patterns in $\mathcal{D}$, including any biases, correlations, or outright falsehoods present in the data.
3. The Falsification Operator
Definition. The Falsification Operator is a function:
\[\mathcal{F}: \mathbb{R}^{d_1 \times \cdots \times d_n} \times \mathcal{R} \rightarrow \mathbb{R}^{d_1 \times \cdots \times d_n}\]where $\mathcal{R}$ is a set of deductive rules. The operator takes a Knowledge Tensor and a rule set, and returns a pruned tensor:
\[\mathcal{T}' = \mathcal{F}(\mathcal{T}, \mathcal{R})\]Proposed semantics. For each rule $r \in \mathcal{R}$ and each tensor entry, $\mathcal{F}$ checks whether the entry is consistent with $r$. If not, the entry is projected to zero:
\[\mathcal{F}(\mathcal{T}, \mathcal{R})\_{i\_1, \ldots, i\_n} = \begin{cases} \mathcal{T}\_{i\_1, \ldots, i\_n} & \text{if } \forall r \in \mathcal{R}: \text{consistent}(\mathcal{T}\_{i\_1, \ldots, i\_n}, r) \\\\ 0 & \text{otherwise} \end{cases}\]This is a hard operation — not a soft penalty. Entries that violate deductive rules are eliminated, not merely discouraged.
What counts as a “rule”?
Rules in $\mathcal{R}$ could include:
- Logical constraints: $\forall x: \text{mammal}(x) \Rightarrow \text{vertebrate}(x)$
- Physical laws: Conservation of energy, transitivity of ordering relations
- Ontological constraints: An entity cannot belong to mutually exclusive categories
- Consistency requirements: $P(A) + P(\neg A) = 1$
The rule set is externally provided, not learned. This is a deliberate design choice: the rules represent the deductive component of knowledge, which in the Popperian framework comes from theory, not from data.
Open problem: differentiability
The falsification operator as defined above is not differentiable — it involves a hard threshold (zero out or keep). This creates a tension with gradient-based training. Three possible resolutions:
- Straight-through estimator: Use the hard operator in the forward pass but approximate gradients in the backward pass (Bengio et al., 2013).
- Alternating optimization: Alternate between gradient-based inductive steps and non-differentiable falsification steps (similar to EM algorithms).
- Smooth approximation: Replace the hard threshold with a steep sigmoid, making $\mathcal{F}$ approximately differentiable. But this compromises the “hard falsification” claim — it becomes a soft constraint with a very high penalty, which is what existing systems already do.
This is the central technical challenge of PKT. Resolution 2 (alternating optimization) is the most promising, as it preserves the hard falsification property while remaining compatible with neural training.
4. Tensor Decomposition as Pruning
After falsification, the tensor $\mathcal{T}’$ may be sparse and noisy. Tensor decomposition provides a principled way to extract the essential structure.
The CP decomposition (Hitchcock, 1927; Kolda & Bader, 2009) approximates a tensor as a sum of rank-one components:
\[\mathcal{T}' \approx \sum_{k=1}^{R} \lambda_k \, \mathbf{a}_k^{(1)} \otimes \mathbf{a}_k^{(2)} \otimes \cdots \otimes \mathbf{a}_k^{(n)}\]where $R$ is the rank, $\lambda_k$ are weights, and $\mathbf{a}_k^{(j)}$ are factor vectors. Choosing a low rank $R$ forces the decomposition to retain only the dominant patterns, discarding noise and hallucination.
Conjecture. Low-rank decomposition of the falsified tensor $\mathcal{T}’$ produces a representation that is both logically consistent (by construction, since inconsistent entries were zeroed) and statistically parsimonious (by the rank constraint). This combination may be a useful formal definition of “knowledge” — structured belief that has survived both empirical and logical tests.
5. The Combined Loss Function
The full PKT training objective combines inductive and falsification losses:
\[\mathcal{L} = \mathcal{L}_{\text{inductive}} + \lambda \, \mathcal{L}_{\text{falsification}}\]where:
- $\mathcal{L}_{\text{inductive}}$ is the standard data likelihood (as above).
- $\mathcal{L}_{\text{falsification}}$ penalizes entries that would be zeroed by $\mathcal{F}$:
- $\lambda > 0$ controls the trade-off between fitting the data and obeying the rules.
Note: This loss function is a soft approximation of the hard falsification operator. In practice, training would likely use this differentiable loss, with periodic applications of the hard operator $\mathcal{F}$ to enforce strict consistency. The interplay between soft guidance (during gradient steps) and hard enforcement (between epochs) is a key area for future work.
6. The Learning Cycle
Putting it all together, PKT proposes an iterative learning process:
Step 1. Induction. Learn $\mathcal{T}^{(t)}$ from data by minimizing $\mathcal{L}$.
Step 2. Falsification. Apply $\mathcal{F}$ to get $\mathcal{T}’^{(t)} = \mathcal{F}(\mathcal{T}^{(t)}, \mathcal{R})$.
Step 3. Decomposition. Compute low-rank approximation of $\mathcal{T}’^{(t)}$ to extract core structure.
Step 4. Refinement. Use the decomposed tensor as initialization for the next inductive step.
Repeat until convergence.
Convergence Conjecture. Under suitable conditions on $\mathcal{R}$ and $\mathcal{D}$, the iteration above converges to a fixed point $\mathcal{T}^$ that is both statistically faithful to $\mathcal{D}$ and logically consistent with $\mathcal{R}$.*
This conjecture is unproven. Conditions for convergence likely require:
- The rule set $\mathcal{R}$ is consistent (no contradictory rules).
- The data $\mathcal{D}$ is not completely at odds with $\mathcal{R}$ (some feasible solution exists).
- The learning rate and falsification schedule are appropriately tuned.
Proving (or disproving) this conjecture is the most important theoretical milestone for PKT.
References
- Bengio, Y., Léonard, N. & Courville, A. (2013). Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arXiv:1308.3432
- Hitchcock, F.L. (1927). The Expression of a Tensor or a Polyadic as a Sum of Products. Journal of Mathematics and Physics, 6(1–4), 164–189.
- Kolda, T.G. & Bader, B.W. (2009). Tensor Decompositions and Applications. SIAM Review, 51(3), 455–500.
Next: Open Questions — the deeper philosophical puzzles this framework raises.