Fetching latest headlines…
Production-Ready FPolicy Event Pipeline Across 17 UCs — FSx for ONTAP S3 Access Points, Phase 11
NORTH AMERICA
🇺🇸 United StatesMay 15, 2026

Production-Ready FPolicy Event Pipeline Across 17 UCs — FSx for ONTAP S3 Access Points, Phase 11

1 views0 likes0 comments
Originally published byDev.to

TL;DR

Phase 11 is the production-integration phase: the Phase 10 FPolicy event-ingestion pipeline is now connected to all 17 use-case (UC) templates, with operational guardrails for persistence, deduplication, observability, and future migration to native S3 Access Point (S3AP) notifications.

This is Phase 11 of the FSx for ONTAP S3AP serverless pattern library. Building on Phase 10, Phase 11 delivers:

  • TriggerMode across all 17 UCs: Every UC template now supports POLLING / EVENT_DRIVEN / HYBRID switching via a single CloudFormation parameter
  • UC-specific EventBridge dispatch rules: File path prefix + extension filters route FPolicy events to the correct UC's Step Functions
  • Protobuf format evaluation: Real-world test on ONTAP 9.17.1P6 — confirmed format switching works, discovered TCP framing difference
  • Cross-Account Observability: OAM Sink + Dashboard + SNS + X-Ray deployed and verified
  • Persistent Store: Configured on ONTAP via REST API — closing the tested Fargate restart event-loss window at the configuration layer
  • Idempotency Store: DynamoDB table + checker Lambda for HYBRID mode deduplication
  • FR-2 migration path: Three-phase design for transitioning to S3AP native notifications when available (FR-2 refers to the feature-request track for native bucket-notification-style support on FSx ONTAP S3 Access Points)
  • Production adoption guidance: Rollout/rollback, governance, security guardrails, event payload sensitivity, file-readiness patterns, operational alarms, and Persistent Store sizing

The 17 UCs span compliance, financial document processing (IDP), manufacturing analytics, healthcare imaging, media/VFX, genomics, logistics, retail, autonomous driving, semiconductor EDA, energy/seismic, education/research, defense/satellite, government archives, smart-city geospatial, insurance claims, and construction BIM.

In short: Phase 10 built the shared event-ingestion pipeline. Phase 11 wires it into every UC, adds the operational infrastructure for production (Persistent Store, Idempotency, Observability), and documents the forward migration path. Tests: 435 passed, 3 skipped.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

1. TriggerMode: Three-Mode Integration Across All 17 UCs

The problem

Phase 10 introduced TriggerMode as a reference implementation in UC1 (legal-compliance). The remaining 16 UCs still only supported polling. Operators needed a uniform way to switch any UC between polling, event-driven, and hybrid modes without template surgery.

The solution

Every UC template now includes:

Parameters:
  TriggerMode:
    Type: String
    Default: "POLLING"
    AllowedValues: ["POLLING", "EVENT_DRIVEN", "HYBRID"]

  FPolicyEventBusName:
    Type: String
    Default: "fsxn-fpolicy-events"

Conditions:
  IsPollingOrHybrid:
    !Or [!Condition IsPolling, !Condition IsHybrid]
  IsEventDrivenOrHybrid:
    !Or [!Condition IsEventDriven, !Condition IsHybrid]

The EventBridge Scheduler and its IAM role use Condition: IsPollingOrHybrid. The FPolicy EventBridge Rule and its IAM role use Condition: IsEventDrivenOrHybrid. Default POLLING means zero impact on existing deployments — the parameter is purely additive.

Validation

  • CloudFormation validate-template: 17/17 PASS
  • cfn_yaml parse: 17/17 PASS
  • SchedulerRole + Schedule condition alignment: 14/14 verified
  • Test suite: 435 passed, 3 skipped, 0 failed

2. UC-Specific EventBridge Dispatch Rules

Architecture

EventBridge Custom Bus (fsxn-fpolicy-events)
  │
  ├── UC1 Rule: prefix=/legal/ OR suffix=.pdf,.docx,.xlsx
  │     → ComplianceStateMachine
  │
  ├── UC2 Rule: prefix=/finance/ OR suffix=.pdf,.tiff,.png,.jpg
  │     → IdpStateMachine
  │
  ├── UC3 Rule: prefix=/manufacturing/ OR suffix=.csv,.json,.parquet
  │     → ManufacturingStateMachine
  │
  │   ... (14 more UCs)
  │
  └── UC17 Rule: prefix=/smartcity/ OR suffix=.geojson,.shp,.tiff,.las
        → DiscoveryFunction (Lambda)

Note: Multiple rules can match the same event; EventBridge fan-out is expected behavior. See the Live E2E verification below.

As the number of UCs grows, routing should be treated as configuration data and used to generate both EventBridge rules and routing tests to prevent drift. The routing definitions documented in docs/guides/fpolicy-uc-routing.md are treated as the source of truth, and scripts/add_eventbridge_rules.py keeps generated EventBridge rules aligned with that routing model.

Each UC's EventBridge Rule filters on:

  • detail.file_path: prefix (directory) and suffix (extension) matchers
  • detail.operation_type: create, write, rename, delete (UC-specific subset)

EventBridge evaluates prefix and suffix within the same array as OR — a file matching any prefix or any suffix triggers the rule. The relationship between operation_type and file_path is AND — both must match.

Fan-out behavior

When multiple rules match the same event, EventBridge delivers to all matching targets. This is by design — a .json file in /manufacturing/sensors/ could trigger both UC3 (manufacturing) and UC11 (autonomous-driving) if both monitor .json files. Prefix design should minimize unintended fan-out.

Live E2E verification

We verified dispatch routing by sending test events directly to the custom bus via aws events put-events:

Test Event file_path Matched Rules Result
verify-legal-01 /legal/audit/report.pdf legal-compliance ✅ + financial-idp ✅ Fan-out: 2 rules matched
verify-finance-01 /finance/contracts/deal.tiff financial-idp ✅ 1 rule matched
verify-mfg-01 /manufacturing/iot/sensor-001.json manufacturing ✅ 1 rule matched
verify-nomatch-01 /random/path/file.xyz None Correctly dropped

Key finding: /legal/audit/report.pdf matched two rules — the legal-compliance rule (prefix /legal/) AND the financial-idp rule (suffix .pdf). This confirms the OR evaluation within the file_path array and demonstrates fan-out behavior in practice.

Recommendation: The main routing lesson is simple: use path prefix as the primary ownership boundary, and treat suffix filters as supplementary hints. Generic suffixes such as .pdf, .json, and .csv are useful for discovery, but they can create intentional or accidental fan-out across UCs. For strict single-UC routing, rely on prefix alone.

Routing documentation

Full routing table with all 17 UCs, their prefixes, suffixes, and target operations is documented in docs/guides/fpolicy-uc-routing.md.

EventBridge Custom Bus

3. Protobuf Format Evaluation

Background

ONTAP 9.15.1+ supports protobuf as an alternative to XML for FPolicy notifications. The theoretical benefits are significant: ~35% message size reduction and faster parsing (with C extensions).

Implementation

Phase 11 delivers a complete protobuf implementation:

  1. Wire-format parser (shared/fpolicy-server/protobuf_parser.py): Pure Python decoder with zero external dependencies. No protobuf package installation required.
  2. Proto schema (shared/fpolicy-server/proto/fpolicy_notification.proto): 14-field FileOperationNotification message definition.
  3. Auto-detection: is_protobuf_format() distinguishes XML from protobuf by inspecting the first byte.
  4. FPolicy Server integration: FPOLICY_FORMAT environment variable switches between xml and protobuf.

Benchmark results (1000 events)

Metric XML (regex) protobuf (pure Python)
Message size (avg) 220 bytes 144 bytes
Size reduction 34.6%
Parse time (1000 events) 0.15 ms 0.32 ms
Parse speedup 1.0x (baseline) 0.47x

The pure Python protobuf parser is slower than Python's C-optimized regex engine. The real benefit is message size reduction — 34.6% fewer bytes through SQS means lower costs and bandwidth. With the C-compiled protobuf library, parsing speed is expected to improve significantly, but this should be re-benchmarked after the protobuf TCP framing layer is implemented.

Real-world test: TCP framing discovery

We switched the ONTAP FPolicy engine format to protobuf via REST API:

PATCH /api/protocols/fpolicy/{svm}/engines/fpolicy_aws_engine
Body: {"format": "protobuf"}

Result: ONTAP immediately sent protobuf NEGO_REQ messages. However, the FPolicy server logged:

[WARNING] Invalid message length: 53554736

Analysis: The value 53554736 (0x03320330) is protobuf field data being misinterpreted as the 4-byte frame length. This reveals that protobuf mode uses different TCP framing than XML mode:

  • XML mode: " + 4-byte big-endian length + " + payload
  • protobuf mode: Different framing (possibly raw protobuf without the quote-delimited wrapper)

Conclusion: The protobuf field-level parser is validated by the Phase 11 unit tests, and the size-reduction benefit is real. However, the live ONTAP test showed that protobuf mode does not use the same TCP framing path as XML mode. Per NetApp documentation, when the engine format is set to protobuf, "notification messages are encoded in binary form using Google Protobuf" and the FPolicy server must support protobuf deserialization. Phase 12 will focus on confirming the protobuf wire framing with NetApp and adapting the transport reader accordingly.

4. Cross-Account Observability

Deployed resources

Resource Purpose
OAM Sink Receives metrics/traces from workload accounts
CloudWatch Dashboard Lambda Invocations/Errors, Step Functions Executions, Processing Latency
SNS Topic (KMS-encrypted) Aggregated alerts from all accounts
X-Ray Group Cross-account trace filtering
IAM MetricDeliveryRole Workload accounts assume this to push metrics
IAM TroubleshootingRole Read-only access for cross-account debugging
Log Group Aggregated log destination

Single-account limitation

OAM Links cannot be created within the same account (AWS design constraint). The deployment was verified as a single-account simulation per the requirements. A workload-account-oam-link.yaml template is provided for multi-account environments.

Template fix: LogDestination

During deployment, AWS::Logs::Destination failed because it requires a Kinesis Data Stream as target, not a Log Group. This clarified that a CloudWatch Logs destination is not a generic alias for another log group; it is a cross-account subscription destination backed by a supported streaming target such as Kinesis Data Streams or Kinesis Data Firehose. The template was fixed to use Log Group + IAM Role directly, with Kinesis Firehose as an optional future addition for high-volume cross-account log aggregation.

CloudWatch Cross-Account Dashboard

5. Persistent Store: Closing the Restart Event-Loss Window

The problem

With is-mandatory=false, ONTAP drops FPolicy notifications when no server is connected. During Fargate task restarts (~30 seconds), events are lost.

The solution

ONTAP 9.14.1+ Persistent Store queues file access events on the SVM during server disconnection for asynchronous non-mandatory policies. When the external server reconnects, queued events can be replayed. Note that synchronous policies and asynchronous mandatory policies are not supported — Persistent Store is specifically designed for the asynchronous non-mandatory configuration used in this pattern.

Configuration (via Lambda → ONTAP REST API)

Step 1: Create volume (1GB, unix security style)
  POST /api/storage/volumes → 202 Accepted (3s)

Step 2: Create Persistent Store
  POST /api/protocols/fpolicy/{svm}/persistent-stores → 201 Created
  Body: {"name": "fpolicy_aws_store", "volume": "fpolicy_persistent_store"}

Step 3: Attach to policy (disable → attach → re-enable)
  PATCH /api/protocols/fpolicy/{svm}/policies/fpolicy_aws
  Body: {"persistent_store": "fpolicy_aws_store"}

Verification

GET /api/protocols/fpolicy/{svm}/policies/fpolicy_aws?fields=persistent_store,enabled
 {"enabled": true, "persistent_store": "fpolicy_aws_store"}

ECS task stop → restart test confirmed ONTAP reconnects to the new task within seconds. With Persistent Store configured, events generated during the tested ~30-second Fargate restart window are expected to be queued by ONTAP and replayed after reconnection. Phase 12 will validate this with real NFS/SMB file operations end to end, including verification of replay ordering and completeness under sustained write load.

IP Updater Lambda extension

The IP Updater Lambda was extended with a generic ONTAP API access capability (action: ontap_api). This enables remote ONTAP configuration without a bastion host:

aws lambda invoke --function-name fsxn-fpolicy-ip-updater \
  --payload '{"action": "ontap_api", "method": "GET", "path": "/api/protocols/fpolicy/{svm}/persistent-stores"}' \
  /tmp/result.json

6. HYBRID Mode Idempotency

The problem

In HYBRID mode, both the EventBridge Scheduler (polling) and the FPolicy EventBridge Rule (event-driven) can trigger processing for the same file. Without deduplication, the same file gets processed twice.

The solution

A DynamoDB-based Idempotency Store with TTL:

Table: fsxn-s3ap-idempotency-store
  pk: "{uc_name}#{file_path}"
  sk: "{operation_type}#{timestamp_bucket}"
  ttl: current_time + 7 days

The timestamp_bucket rounds timestamps to 5-minute windows. Two events for the same file within the same 5-minute window are considered duplicates.

Step Functions integration

The Idempotency Checker runs as the first step in any UC's Step Functions workflow:

{
  "StartAt": "IdempotencyCheck",
  "States": {
    "IdempotencyCheck": {
      "Type": "Task",
      "Resource": "${IdempotencyCheckerFunction.Arn}",
      "Next": "CheckDuplicate"
    },
    "CheckDuplicate": {
      "Type": "Choice",
      "Choices": [{
        "Variable": "$.idempotency.is_duplicate",
        "BooleanEquals": true,
        "Next": "SkipDuplicate"
      }],
      "Default": "ProcessEvent"
    }
  }
}

Race conditions are handled via DynamoDB conditional writes (attribute_not_exists(pk)). If two executions race, only one succeeds — the other gets ConditionalCheckFailedException and skips.

Tuning considerations

The 5-minute bucket is intentionally conservative for HYBRID-mode deduplication. UCs that require multiple legitimate updates to the same file within a short interval can tune the bucket size via the DEDUP_WINDOW_MINUTES environment variable, or include an additional event attribute (such as file size or ONTAP event sequence information) in the sort key to distinguish genuinely distinct events from duplicates.

DynamoDB Idempotency Store

Live E2E verification

Verified the deduplication mechanism directly against the deployed DynamoDB table:

1st PutItem (pk=legal-compliance#/legal/audit/report.pdf, sk=create#2026-05-15T10:35):
  → Success (new record created)

2nd PutItem (same key, condition: attribute_not_exists(pk)):
  → ConditionalCheckFailedException ✅ (duplicate detected)

This proves the table-level duplicate rejection mechanism used by HYBRID mode. When the Idempotency Checker is the first Step Functions task, the second execution can be rejected before downstream processing starts.

7. FR-2 Migration Path

If/when native S3AP notifications become available through the FR-2 track, the migration is designed to be parameter-change-only for UCs that do not depend on FPolicy-only fields:

Phase TriggerMode FPolicy S3AP Notifications
A (parallel) HYBRID Active Active
B (cutover) EVENT_DRIVEN Disabled Active
C (cleanup) EVENT_DRIVEN Removed Active

Schema compatibility challenges

FPolicy field S3AP equivalent Gap
user_name N/A S3AP may not include NTFS user info
operation_type: rename N/A S3 events don't have rename
protocol Always "s3" Loss of NFS/SMB distinction

UCs that depend on user_name (permission-aware scenarios) may need to maintain FPolicy even after FR-2 GA.

Full migration path documented in docs/guides/fr2-migration-path.md.

8. Test Results

Category Count Result
Existing tests (Phase 1-10) 391 All PASS ✅
protobuf parser tests 18 All PASS ✅
Idempotency checker tests 10 All PASS ✅
FPolicy engine tests 16 All PASS ✅
Skipped (handler refactored) 3 Expected ⏭️
Total 435 + 3 skipped All PASS

CloudFormation validation

Method Result
cfn_yaml parse (all 17 UCs) 17/17 PASS
aws cloudformation validate-template 17/17 PASS
shared templates (observability, idempotency, OAM link) 4/4 PASS

9. Deployed AWS Resources

Stack Resources Status
fsxn-shared-observability OAM Sink, Dashboard, SNS, X-Ray Group, IAM Roles
fsxn-idempotency-store DynamoDB (PAY_PER_REQUEST, TTL, PITR)
fsxn-fpolicy-routing EventBridge Bus, Bridge Lambda, Idempotency Table
fsxn-fp-srv ECS Fargate Cluster, FPolicy Server Service
fsxn-fpolicy-ingestion SQS Queue, DLQ, IP Updater Lambda

ONTAP resources

Resource Status
FPolicy policy fpolicy_aws Enabled, persistent_store attached
Persistent Store fpolicy_aws_store Active (1GB volume)
Engine format XML (protobuf tested, reverted due to framing)

Post-deployment health check (2026-05-15)

Component Status Detail
FPolicy Server (ECS Fargate) ✅ Running ONTAP connecting every 10s
SQS Ingestion Queue ✅ Empty (0/0/0) No stuck messages
FPolicy Policy ✅ Enabled persistent_store + engine attached
DynamoDB Idempotency ✅ Active TTL enabled, PITR on
SNS Alerts ⚠️ PendingConfirmation Email subscription awaiting confirmation
EventBridge Custom Bus ✅ Operational Dispatch routing verified via put-events

CloudFormation Stacks
ECS FPolicy Server

10. Deployment Learnings

Issue Root Cause Fix
validate-template fails for autonomous-driving Template exceeds 51,200 byte inline limit Use S3 URL for validation; added CI job
AWS::Logs::Destination creation fails Requires Kinesis target, not Log Group Removed LogDestination, use Log Group directly
OAM Link same-account error AWS design: links only work cross-account Documented; provided workload-account template
SchedulerRole created in EVENT_DRIVEN mode Missing Condition on SchedulerRole Added Condition: IsPollingOrHybrid to 14 templates
protobuf messages rejected as invalid length Different TCP framing in protobuf mode Documented; XML mode maintained for stability
test_fpolicy_engine import errors Handler refactored to IP Updater Added missing exports; skipped 3 legacy tests
Persistent Store autoflush_enabled rejected Parameter name not supported in REST API Removed; ONTAP uses defaults
Policy modification while enabled ONTAP rejects PATCH on enabled policy Disable → modify → re-enable sequence
.pdf suffix causes multi-UC fan-out EventBridge OR evaluation within array Document: use prefix as primary filter
EventBridge → CloudWatch Logs delivery fails Missing resource policy on log group Added logs:PutLogEvents permission for events.amazonaws.com

11. Production Adoption Guidance

Recommended rollout model

TriggerMode is not just a template parameter — it is the operational rollout and rollback control for event-driven adoption. A detailed guide with rollback criteria, UC classification, and CloudFormation behavior matrix is available in docs/guides/triggermode-rollout.md. The summary:

  1. Start with POLLING for all UCs to preserve existing behavior.
  2. Enable the shared FPolicy ingestion pipeline and validate EventBridge routing with put-events.
  3. Move one low-risk UC to HYBRID and observe duplicate rate, Step Functions success rate, and SQS backlog.
  4. Move latency-sensitive UCs to EVENT_DRIVEN after routing and idempotency validation.
  5. Keep compliance-sensitive UCs in HYBRID until Persistent Store replay is validated end to end.

Rollback: At any stage, reverting TriggerMode to the previous value via CloudFormation stack update restores the CloudFormation-managed resources for the prior mode. Operators should wait for stack update completion and verify scheduler/rule state, SQS backlog, and Step Functions executions before declaring rollback complete. The sequence is always EVENT_DRIVEN → HYBRID → POLLING (never skip HYBRID when rolling back from EVENT_DRIVEN in production).

Security guardrails for ONTAP API automation

The ontap_api action is intended for controlled operations automation, not as an unrestricted ONTAP proxy. The handler implementation (shared/lambdas/fpolicy_engine/handler.py) enforces:

  • Path allowlist: Only /api/protocols/fpolicy/, /api/storage/volumes, /api/storage/aggregates, and /api/cluster/jobs/ are permitted. All other paths return HTTP 403.
  • DELETE method restriction: Disabled by default. Requires explicit ONTAP_API_ALLOW_DELETE=true environment variable to enable.
  • Log redaction: Only method and path are logged — request bodies containing credentials are never written to CloudWatch Logs.
  • Structured audit log: Each invocation emits a structured log line with method, path, status, correlation_id, and request timestamp. Caller identity can be correlated via CloudTrail Lambda Invoke events without logging sensitive request/response bodies.

Production deployments should additionally restrict Lambda invoke permissions to deployment automation roles only, and store ONTAP credentials in Secrets Manager with rotation planning.

Pass correlation_id in the event payload to trace ONTAP API operations across deployment automation, Lambda logs, and operational runbooks.

MSP and multi-customer naming

For MSP or multi-customer deployments, parameterize shared resource names with CustomerId, EnvironmentName, and Region to avoid cross-tenant naming collisions — for example: {customer}-{env}-fsxn-fpolicy-events and {customer}-{env}-s3ap-idempotency-store. Full naming guidance with CloudFormation examples is in docs/guides/triggermode-rollout.md.

TriggerMode governance

For enterprise rollout, treat TriggerMode as a governed operational control. Changes from POLLING to HYBRID or EVENT_DRIVEN should be reviewed with routing test results, idempotency validation, alarm readiness, and rollback owner assignment. Track TriggerMode changes through your change management process (Change Manager, GitOps PR, or deployment pipeline logs) — not just CloudFormation stack events.

Event payload sensitivity

For public-sector or regulated workloads, file paths and FPolicy metadata should be treated as potentially sensitive data. In regulated environments, metadata is data — file paths, user names, and protocol context should be classified before being forwarded to cross-account observability systems. Production deployments should define which event fields are logged, masked, hashed, or excluded before forwarding to cross-account observability or long-term audit storage. A data classification guide is available in docs/guides/data-classification.md.

For regulated workloads, duplicate suppression should not mean audit disappearance; skipped duplicate events should still be recorded with correlation IDs and deduplication decisions. See docs/guides/compliance-audit-ledger.md for the audit ledger design.

File readiness for event-driven pipelines

For large files, an FPolicy create or write event may arrive before the file write is complete — particularly with NFSv3 which lacks close semantics. UCs that process large analytics, imaging, geospatial, or EDA files should combine event-driven triggering with a readiness strategy:

  • Rename-based commit: Write to a temporary path, rename to final path on completion. Process only rename events.
  • Marker file: Write a .done or _SUCCESS marker after the primary file is complete. Trigger on marker creation.
  • Size-stability check: Poll file size at N-second intervals; start processing when size is stable across two consecutive checks.

The existing WRITE_COMPLETE_DELAY_SEC (default 5s) in the FPolicy server provides a basic delay, but is insufficient for multi-GB files. A fixed delay should be treated as a fallback, not a correctness guarantee. The new UC checklist (docs/guides/new-uc-checklist.md) includes file readiness as a required design decision for large-file UCs.

Recommended operational alarms

A ready-to-deploy CloudFormation template (shared/cfn/recommended-alarms.yaml) defines the following alarms. Severity labels are examples and should be mapped to each organization's incident classification model.

Metric Condition Severity
SQS ApproximateAgeOfOldestMessage > 300 seconds for 5 minutes SEV2
SQS DLQ ApproximateNumberOfMessagesVisible > 0 SEV2
Step Functions ExecutionsFailed > 0 for critical production UCs SEV2
ECS RunningTaskCount < DesiredTaskCount for > 60 seconds SEV1
DynamoDB ThrottledRequests > 0 SEV3

The ECS desired-vs-running alarm may require Container Insights, metric math, or a custom service health metric depending on how ECS service metrics are emitted in the target account. For high-volume batch UCs, failure-rate-based alarms may be less noisy than absolute failure-count alarms.

Deploy as a standalone monitoring stack or integrate into each UC template's EnableCloudWatchAlarms section.

Initial SLO candidates

While formal SLO definition is a Phase 12 deliverable, the following targets serve as initial operational guidance:

  • 99% of events delivered to SQS within 60 seconds under normal load
  • FPolicy server reconnect within 60 seconds after ECS task replacement
  • SQS backlog recovered within 5 minutes after planned maintenance
  • Step Functions start latency under 2 minutes for EVENT_DRIVEN UCs

Persistent Store sizing

For environments requiring Persistent Store, size the volume based on expected outage duration:

required_size = event_rate_per_sec × max_outage_duration_sec × avg_event_size_bytes × safety_factor

Example: 100 events/sec × 300s outage × 500 bytes × 2.0 safety ≈ 30 MB of raw event data. The 1 GB volume configured in Phase 11 provides room for roughly 2 million 500-byte event records before applying operational safety margin; with a 2.0 safety factor, treat the practical planning capacity as closer to 1 million events. High-frequency environments (1000+ events/sec) should increase the volume size proportionally and validate replay rate after reconnection.

Full sizing table with scenario-based estimates is available in docs/event-driven/fpolicy-persistent-store.md.

12. Next Phase Outlook

Phase 11 completes the event-driven integration layer. Remaining work for Phase 12:

Protocol and replay validation

  1. protobuf TCP framing: Consult NetApp support on protobuf wire format; adapt read_fpolicy_message() for frameless protobuf
  2. Persistent Store replay E2E validation: NFS/SMB file creation during Fargate restart → verify that queued events are replayed and delivered to SQS without loss
  3. Replay storm testing: Generate events during FPolicy server downtime, reconnect, measure replay duration, SQS ingestion rate, Step Functions concurrency, and whether downstream throttling occurs

Scale and operations

  1. High-load testing: 1000+ events/sec stress test with Fargate scaling
  2. SLO definition: Define event ingestion latency, processing success rate, reconnect time, and replay completion time targets
  3. Multi-account OAM Link: Deploy workload-account-oam-link.yaml in a second account

Production rollout

  1. Production UC deployment: Deploy a UC template with TriggerMode=EVENT_DRIVEN and verify end-to-end file operation → Step Functions execution

Already verified in Phase 11 (no longer Phase 12 candidates):

  • ✅ EventBridge dispatch routing (put-events → rule matching → CloudWatch Logs)
  • ✅ Idempotency Store deduplication (conditional write rejection)
  • ✅ Persistent Store configuration (ONTAP REST API)
  • ✅ ECS task restart + ONTAP reconnection

Who should care about Phase 11?

  • Platform teams can now switch any UC between polling and event-driven with a single parameter change — no template surgery required
  • Operations / SRE teams get Cross-Account Observability with a pre-built dashboard, recommended alarm thresholds, and a rollout/rollback model
  • Compliance teams get Persistent Store support to close the tested Fargate restart event-loss window, with full replay validation planned for Phase 12
  • Security teams get documented guardrails for the ONTAP API automation path, including allowlist, audit recommendations, and event payload sensitivity guidance
  • Architecture teams get a documented FR-2 migration path — if/when native S3AP notifications become available, the transition is a parameter change for compatible UCs
  • Data engineering teams get file-readiness guidance for large-file analytics pipelines where event arrival precedes write completion
  • MSPs and partners get cross-account templates, tenant-aware naming guidance, and a standardized TriggerMode control for multi-customer deployments
  • Performance engineers get protobuf evaluation data (34.6% size reduction) and a clear path to enabling it once TCP framing is resolved
  • DevOps teams get CI-integrated template validation (cfn_yaml + validate-template) catching issues before deployment

Conclusion

Phase 11 transforms the FPolicy event-driven pipeline from a single-UC reference implementation into a production-ready, 17-UC integrated system. TriggerMode is not just a template parameter — it is the operational rollout and rollback control for event-driven adoption, enabling platform teams to move individual UCs through POLLING → HYBRID → EVENT_DRIVEN at their own pace.

UC-specific EventBridge rules handle routing complexity through path-prefix ownership boundaries, while the Idempotency Store prevents duplicate processing in HYBRID mode. Persistent Store closes the known Fargate restart event-loss window at the ONTAP configuration layer, while Phase 12 will validate replay completeness with real NFS/SMB file operations.

The protobuf evaluation yielded a valuable real-world finding: ONTAP uses different TCP framing for protobuf messages than for XML. The field-level parser is validated against test fixtures, but the transport layer needs adaptation — a focused Phase 12 task requiring NetApp consultation rather than a blocker.

With 435 passing tests, 17 validated templates, 5 deployed CloudFormation stacks, production adoption guidance (rollout model, governance, security guardrails, event payload sensitivity, file readiness, alarm thresholds, Persistent Store sizing), and comprehensive documentation, Phase 11 delivers the operational maturity needed for enterprise-grade event-driven file workflows on FSx for ONTAP.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
Previous phases: Phase 1 · Phase 7 · Phase 8 · Phase 9 · Phase 10

Comments (0)

Sign in to join the discussion

Be the first to comment!