TechBlast - Tech News for Builders and Operators

TL;DR

Phase 11 is the production-integration phase: the Phase 10 FPolicy event-ingestion pipeline is now connected to all 17 use-case (UC) templates, with operational guardrails for persistence, deduplication, observability, and future migration to native S3 Access Point (S3AP) notifications.

This is Phase 11 of the FSx for ONTAP S3AP serverless pattern library. Building on Phase 10, Phase 11 delivers:

TriggerMode across all 17 UCs: Every UC template now supports POLLING / EVENT_DRIVEN / HYBRID switching via a single CloudFormation parameter
UC-specific EventBridge dispatch rules: File path prefix + extension filters route FPolicy events to the correct UC's Step Functions
Protobuf format evaluation: Real-world test on ONTAP 9.17.1P6 — confirmed format switching works, discovered TCP framing difference
Cross-Account Observability: OAM Sink + Dashboard + SNS + X-Ray deployed and verified
Persistent Store: Configured on ONTAP via REST API — closing the tested Fargate restart event-loss window at the configuration layer
Idempotency Store: DynamoDB table + checker Lambda for HYBRID mode deduplication
FR-2 migration path: Three-phase design for transitioning to S3AP native notifications when available (FR-2 refers to the feature-request track for native bucket-notification-style support on FSx ONTAP S3 Access Points)
Production adoption guidance: Rollout/rollback, governance, security guardrails, event payload sensitivity, file-readiness patterns, operational alarms, and Persistent Store sizing

The 17 UCs span compliance, financial document processing (IDP), manufacturing analytics, healthcare imaging, media/VFX, genomics, logistics, retail, autonomous driving, semiconductor EDA, energy/seismic, education/research, defense/satellite, government archives, smart-city geospatial, insurance claims, and construction BIM.

In short: Phase 10 built the shared event-ingestion pipeline. Phase 11 wires it into every UC, adds the operational infrastructure for production (Persistent Store, Idempotency, Observability), and documents the forward migration path. Tests: 435 passed, 3 skipped.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

1. TriggerMode: Three-Mode Integration Across All 17 UCs

The problem

Phase 10 introduced TriggerMode as a reference implementation in UC1 (legal-compliance). The remaining 16 UCs still only supported polling. Operators needed a uniform way to switch any UC between polling, event-driven, and hybrid modes without template surgery.

The solution

Every UC template now includes:

Parameters:
  TriggerMode:
    Type: String
    Default: "POLLING"
    AllowedValues: ["POLLING", "EVENT_DRIVEN", "HYBRID"]

  FPolicyEventBusName:
    Type: String
    Default: "fsxn-fpolicy-events"

Conditions:
  IsPollingOrHybrid:
    !Or [!Condition IsPolling, !Condition IsHybrid]
  IsEventDrivenOrHybrid:
    !Or [!Condition IsEventDriven, !Condition IsHybrid]

The EventBridge Scheduler and its IAM role use Condition: IsPollingOrHybrid. The FPolicy EventBridge Rule and its IAM role use Condition: IsEventDrivenOrHybrid. Default POLLING means zero impact on existing deployments — the parameter is purely additive.

Validation

CloudFormation validate-template: 17/17 PASS
cfn_yaml parse: 17/17 PASS
SchedulerRole + Schedule condition alignment: 14/14 verified
Test suite: 435 passed, 3 skipped, 0 failed

2. UC-Specific EventBridge Dispatch Rules

Architecture

EventBridge Custom Bus (fsxn-fpolicy-events)
  │
  ├── UC1 Rule: prefix=/legal/ OR suffix=.pdf,.docx,.xlsx
  │     → ComplianceStateMachine
  │
  ├── UC2 Rule: prefix=/finance/ OR suffix=.pdf,.tiff,.png,.jpg
  │     → IdpStateMachine
  │
  ├── UC3 Rule: prefix=/manufacturing/ OR suffix=.csv,.json,.parquet
  │     → ManufacturingStateMachine
  │
  │   ... (14 more UCs)
  │
  └── UC17 Rule: prefix=/smartcity/ OR suffix=.geojson,.shp,.tiff,.las
        → DiscoveryFunction (Lambda)

Note: Multiple rules can match the same event; EventBridge fan-out is expected behavior. See the Live E2E verification below.

As the number of UCs grows, routing should be treated as configuration data and used to generate both EventBridge rules and routing tests to prevent drift. The routing definitions documented in docs/guides/fpolicy-uc-routing.md are treated as the source of truth, and scripts/add_eventbridge_rules.py keeps generated EventBridge rules aligned with that routing model.

Each UC's EventBridge Rule filters on:

detail.file_path: prefix (directory) and suffix (extension) matchers
detail.operation_type: create, write, rename, delete (UC-specific subset)

EventBridge evaluates prefix and suffix within the same array as OR — a file matching any prefix or any suffix triggers the rule. The relationship between operation_type and file_path is AND — both must match.

Fan-out behavior

When multiple rules match the same event, EventBridge delivers to all matching targets. This is by design — a .json file in /manufacturing/sensors/ could trigger both UC3 (manufacturing) and UC11 (autonomous-driving) if both monitor .json files. Prefix design should minimize unintended fan-out.

Live E2E verification

We verified dispatch routing by sending test events directly to the custom bus via aws events put-events:

Test Event	file_path	Matched Rules	Result
`verify-legal-01`	`/legal/audit/report.pdf`	legal-compliance ✅ + financial-idp ✅	Fan-out: 2 rules matched
`verify-finance-01`	`/finance/contracts/deal.tiff`	financial-idp ✅	1 rule matched
`verify-mfg-01`	`/manufacturing/iot/sensor-001.json`	manufacturing ✅	1 rule matched
`verify-nomatch-01`	`/random/path/file.xyz`	None	Correctly dropped

Key finding: /legal/audit/report.pdf matched two rules — the legal-compliance rule (prefix /legal/) AND the financial-idp rule (suffix .pdf). This confirms the OR evaluation within the file_path array and demonstrates fan-out behavior in practice.

Recommendation: The main routing lesson is simple: use path prefix as the primary ownership boundary, and treat suffix filters as supplementary hints. Generic suffixes such as .pdf, .json, and .csv are useful for discovery, but they can create intentional or accidental fan-out across UCs. For strict single-UC routing, rely on prefix alone.

Routing documentation

Full routing table with all 17 UCs, their prefixes, suffixes, and target operations is documented in docs/guides/fpolicy-uc-routing.md.

3. Protobuf Format Evaluation

Background

ONTAP 9.15.1+ supports protobuf as an alternative to XML for FPolicy notifications. The theoretical benefits are significant: ~35% message size reduction and faster parsing (with C extensions).

Implementation

Phase 11 delivers a complete protobuf implementation:

Wire-format parser (shared/fpolicy-server/protobuf_parser.py): Pure Python decoder with zero external dependencies. No protobuf package installation required.
Proto schema (shared/fpolicy-server/proto/fpolicy_notification.proto): 14-field FileOperationNotification message definition.
Auto-detection: is_protobuf_format() distinguishes XML from protobuf by inspecting the first byte.
FPolicy Server integration: FPOLICY_FORMAT environment variable switches between xml and protobuf.

Benchmark results (1000 events)

Metric	XML (regex)	protobuf (pure Python)
Message size (avg)	220 bytes	144 bytes
Size reduction	—	34.6%
Parse time (1000 events)	0.15 ms	0.32 ms
Parse speedup	1.0x (baseline)	0.47x

The pure Python protobuf parser is slower than Python's C-optimized regex engine. The real benefit is message size reduction — 34.6% fewer bytes through SQS means lower costs and bandwidth. With the C-compiled protobuf library, parsing speed is expected to improve significantly, but this should be re-benchmarked after the protobuf TCP framing layer is implemented.

Real-world test: TCP framing discovery

We switched the ONTAP FPolicy engine format to protobuf via REST API:

PATCH /api/protocols/fpolicy/{svm}/engines/fpolicy_aws_engine
Body: {"format": "protobuf"}

Result: ONTAP immediately sent protobuf NEGO_REQ messages. However, the FPolicy server logged:

[WARNING] Invalid message length: 53554736

Analysis: The value 53554736 (0x03320330) is protobuf field data being misinterpreted as the 4-byte frame length. This reveals that protobuf mode uses different TCP framing than XML mode:

XML mode: " + 4-byte big-endian length + " + payload
protobuf mode: Different framing (possibly raw protobuf without the quote-delimited wrapper)

Conclusion: The protobuf field-level parser is validated by the Phase 11 unit tests, and the size-reduction benefit is real. However, the live ONTAP test showed that protobuf mode does not use the same TCP framing path as XML mode. Per NetApp documentation, when the engine format is set to protobuf, "notification messages are encoded in binary form using Google Protobuf" and the FPolicy server must support protobuf deserialization. Phase 12 will focus on confirming the protobuf wire framing with NetApp and adapting the transport reader accordingly.

4. Cross-Account Observability

Deployed resources

Resource	Purpose
OAM Sink	Receives metrics/traces from workload accounts
CloudWatch Dashboard	Lambda Invocations/Errors, Step Functions Executions, Processing Latency
SNS Topic (KMS-encrypted)	Aggregated alerts from all accounts
X-Ray Group	Cross-account trace filtering
IAM MetricDeliveryRole	Workload accounts assume this to push metrics
IAM TroubleshootingRole	Read-only access for cross-account debugging
Log Group	Aggregated log destination

Single-account limitation

OAM Links cannot be created within the same account (AWS design constraint). The deployment was verified as a single-account simulation per the requirements. A workload-account-oam-link.yaml template is provided for multi-account environments.

Template fix: LogDestination

During deployment, AWS::Logs::Destination failed because it requires a Kinesis Data Stream as target, not a Log Group. This clarified that a CloudWatch Logs destination is not a generic alias for another log group; it is a cross-account subscription destination backed by a supported streaming target such as Kinesis Data Streams or Kinesis Data Firehose. The template was fixed to use Log Group + IAM Role directly, with Kinesis Firehose as an optional future addition for high-volume cross-account log aggregation.

5. Persistent Store: Closing the Restart Event-Loss Window

The problem

With is-mandatory=false, ONTAP drops FPolicy notifications when no server is connected. During Fargate task restarts (~30 seconds), events are lost.

The solution

ONTAP 9.14.1+ Persistent Store queues file access events on the SVM during server disconnection for asynchronous non-mandatory policies. When the external server reconnects, queued events can be replayed. Note that synchronous policies and asynchronous mandatory policies are not supported — Persistent Store is specifically designed for the asynchronous non-mandatory configuration used in this pattern.

Configuration (via Lambda → ONTAP REST API)

Step 1: Create volume (1GB, unix security style)
  POST /api/storage/volumes → 202 Accepted (3s)

Step 2: Create Persistent Store
  POST /api/protocols/fpolicy/{svm}/persistent-stores → 201 Created
  Body: {"name": "fpolicy_aws_store", "volume": "fpolicy_persistent_store"}

Step 3: Attach to policy (disable → attach → re-enable)
  PATCH /api/protocols/fpolicy/{svm}/policies/fpolicy_aws
  Body: {"persistent_store": "fpolicy_aws_store"}

Verification

GET /api/protocols/fpolicy/{svm}/policies/fpolicy_aws?fields=persistent_store,enabled
→ {"enabled": true, "persistent_store": "fpolicy_aws_store"}

ECS task stop → restart test confirmed ONTAP reconnects to the new task within seconds. With Persistent Store configured, events generated during the tested ~30-second Fargate restart window are expected to be queued by ONTAP and replayed after reconnection. Phase 12 will validate this with real NFS/SMB file operations end to end, including verification of replay ordering and completeness under sustained write load.

IP Updater Lambda extension

The IP Updater Lambda was extended with a generic ONTAP API access capability (action: ontap_api). This enables remote ONTAP configuration without a bastion host:

aws lambda invoke --function-name fsxn-fpolicy-ip-updater \
  --payload '{"action": "ontap_api", "method": "GET", "path": "/api/protocols/fpolicy/{svm}/persistent-stores"}' \
  /tmp/result.json

6. HYBRID Mode Idempotency

The problem

In HYBRID mode, both the EventBridge Scheduler (polling) and the FPolicy EventBridge Rule (event-driven) can trigger processing for the same file. Without deduplication, the same file gets processed twice.

The solution

A DynamoDB-based Idempotency Store with TTL:

Table: fsxn-s3ap-idempotency-store
  pk: "{uc_name}#{file_path}"
  sk: "{operation_type}#{timestamp_bucket}"
  ttl: current_time + 7 days

The timestamp_bucket rounds timestamps to 5-minute windows. Two events for the same file within the same 5-minute window are considered duplicates.

Step Functions integration

The Idempotency Checker runs as the first step in any UC's Step Functions workflow:

{
  "StartAt": "IdempotencyCheck",
  "States": {
    "IdempotencyCheck": {
      "Type": "Task",
      "Resource": "${IdempotencyCheckerFunction.Arn}",
      "Next": "CheckDuplicate"
    },
    "CheckDuplicate": {
      "Type": "Choice",
      "Choices": [{
        "Variable": "$.idempotency.is_duplicate",
        "BooleanEquals": true,
        "Next": "SkipDuplicate"
      }],
      "Default": "ProcessEvent"
    }
  }
}

Race conditions are handled via DynamoDB conditional writes (attribute_not_exists(pk)). If two executions race, only one succeeds — the other gets ConditionalCheckFailedException and skips.

Tuning considerations

The 5-minute bucket is intentionally conservative for HYBRID-mode deduplication. UCs that require multiple legitimate updates to the same file within a short interval can tune the bucket size via the DEDUP_WINDOW_MINUTES environment variable, or include an additional event attribute (such as file size or ONTAP event sequence information) in the sort key to distinguish genuinely distinct events from duplicates.

Live E2E verification

Verified the deduplication mechanism directly against the deployed DynamoDB table:

1st PutItem (pk=legal-compliance#/legal/audit/report.pdf, sk=create#2026-05-15T10:35):
  → Success (new record created)

2nd PutItem (same key, condition: attribute_not_exists(pk)):
  → ConditionalCheckFailedException ✅ (duplicate detected)

This proves the table-level duplicate rejection mechanism used by HYBRID mode. When the Idempotency Checker is the first Step Functions task, the second execution can be rejected before downstream processing starts.

7. FR-2 Migration Path

If/when native S3AP notifications become available through the FR-2 track, the migration is designed to be parameter-change-only for UCs that do not depend on FPolicy-only fields:

Phase	TriggerMode	FPolicy	S3AP Notifications
A (parallel)	HYBRID	Active	Active
B (cutover)	EVENT_DRIVEN	Disabled	Active
C (cleanup)	EVENT_DRIVEN	Removed	Active

Schema compatibility challenges

FPolicy field	S3AP equivalent	Gap
`user_name`	N/A	S3AP may not include NTFS user info
`operation_type: rename`	N/A	S3 events don't have rename
`protocol`	Always "s3"	Loss of NFS/SMB distinction

UCs that depend on user_name (permission-aware scenarios) may need to maintain FPolicy even after FR-2 GA.

Full migration path documented in docs/guides/fr2-migration-path.md.

8. Test Results

Category	Count	Result
Existing tests (Phase 1-10)	391	All PASS ✅
protobuf parser tests	18	All PASS ✅
Idempotency checker tests	10	All PASS ✅
FPolicy engine tests	16	All PASS ✅
Skipped (handler refactored)	3	Expected ⏭️
Total	435 + 3 skipped	All PASS

CloudFormation validation

Method	Result
cfn_yaml parse (all 17 UCs)	17/17 PASS
`aws cloudformation validate-template`	17/17 PASS
shared templates (observability, idempotency, OAM link)	4/4 PASS

9. Deployed AWS Resources

Stack	Resources	Status
`fsxn-shared-observability`	OAM Sink, Dashboard, SNS, X-Ray Group, IAM Roles	✅
`fsxn-idempotency-store`	DynamoDB (PAY_PER_REQUEST, TTL, PITR)	✅
`fsxn-fpolicy-routing`	EventBridge Bus, Bridge Lambda, Idempotency Table	✅
`fsxn-fp-srv`	ECS Fargate Cluster, FPolicy Server Service	✅
`fsxn-fpolicy-ingestion`	SQS Queue, DLQ, IP Updater Lambda	✅

ONTAP resources

Resource	Status
FPolicy policy `fpolicy_aws`	Enabled, persistent_store attached
Persistent Store `fpolicy_aws_store`	Active (1GB volume)
Engine format	XML (protobuf tested, reverted due to framing)

Post-deployment health check (2026-05-15)

Component	Status	Detail
FPolicy Server (ECS Fargate)	✅ Running	ONTAP connecting every 10s
SQS Ingestion Queue	✅ Empty (0/0/0)	No stuck messages
FPolicy Policy	✅ Enabled	persistent_store + engine attached
DynamoDB Idempotency	✅ Active	TTL enabled, PITR on
SNS Alerts	⚠️ PendingConfirmation	Email subscription awaiting confirmation
EventBridge Custom Bus	✅ Operational	Dispatch routing verified via put-events

10. Deployment Learnings

Issue	Root Cause	Fix
`validate-template` fails for autonomous-driving	Template exceeds 51,200 byte inline limit	Use S3 URL for validation; added CI job
`AWS::Logs::Destination` creation fails	Requires Kinesis target, not Log Group	Removed LogDestination, use Log Group directly
OAM Link same-account error	AWS design: links only work cross-account	Documented; provided workload-account template
SchedulerRole created in EVENT_DRIVEN mode	Missing Condition on SchedulerRole	Added `Condition: IsPollingOrHybrid` to 14 templates
protobuf messages rejected as invalid length	Different TCP framing in protobuf mode	Documented; XML mode maintained for stability
`test_fpolicy_engine` import errors	Handler refactored to IP Updater	Added missing exports; skipped 3 legacy tests
Persistent Store `autoflush_enabled` rejected	Parameter name not supported in REST API	Removed; ONTAP uses defaults
Policy modification while enabled	ONTAP rejects PATCH on enabled policy	Disable → modify → re-enable sequence
`.pdf` suffix causes multi-UC fan-out	EventBridge OR evaluation within array	Document: use prefix as primary filter
EventBridge → CloudWatch Logs delivery fails	Missing resource policy on log group	Added `logs:PutLogEvents` permission for events.amazonaws.com

11. Production Adoption Guidance

Recommended rollout model

TriggerMode is not just a template parameter — it is the operational rollout and rollback control for event-driven adoption. A detailed guide with rollback criteria, UC classification, and CloudFormation behavior matrix is available in docs/guides/triggermode-rollout.md. The summary:

Start with POLLING for all UCs to preserve existing behavior.
Enable the shared FPolicy ingestion pipeline and validate EventBridge routing with put-events.
Move one low-risk UC to HYBRID and observe duplicate rate, Step Functions success rate, and SQS backlog.
Move latency-sensitive UCs to EVENT_DRIVEN after routing and idempotency validation.
Keep compliance-sensitive UCs in HYBRID until Persistent Store replay is validated end to end.

Rollback: At any stage, reverting TriggerMode to the previous value via CloudFormation stack update restores the CloudFormation-managed resources for the prior mode. Operators should wait for stack update completion and verify scheduler/rule state, SQS backlog, and Step Functions executions before declaring rollback complete. The sequence is always EVENT_DRIVEN → HYBRID → POLLING (never skip HYBRID when rolling back from EVENT_DRIVEN in production).

Security guardrails for ONTAP API automation

The ontap_api action is intended for controlled operations automation, not as an unrestricted ONTAP proxy. The handler implementation (shared/lambdas/fpolicy_engine/handler.py) enforces:

Path allowlist: Only /api/protocols/fpolicy/, /api/storage/volumes, /api/storage/aggregates, and /api/cluster/jobs/ are permitted. All other paths return HTTP 403.
DELETE method restriction: Disabled by default. Requires explicit ONTAP_API_ALLOW_DELETE=true environment variable to enable.
Log redaction: Only method and path are logged — request bodies containing credentials are never written to CloudWatch Logs.
Structured audit log: Each invocation emits a structured log line with method, path, status, correlation_id, and request timestamp. Caller identity can be correlated via CloudTrail Lambda Invoke events without logging sensitive request/response bodies.

Production deployments should additionally restrict Lambda invoke permissions to deployment automation roles only, and store ONTAP credentials in Secrets Manager with rotation planning.

Pass correlation_id in the event payload to trace ONTAP API operations across deployment automation, Lambda logs, and operational runbooks.

MSP and multi-customer naming

For MSP or multi-customer deployments, parameterize shared resource names with CustomerId, EnvironmentName, and Region to avoid cross-tenant naming collisions — for example: {customer}-{env}-fsxn-fpolicy-events and {customer}-{env}-s3ap-idempotency-store. Full naming guidance with CloudFormation examples is in docs/guides/triggermode-rollout.md.

TriggerMode governance

For enterprise rollout, treat TriggerMode as a governed operational control. Changes from POLLING to HYBRID or EVENT_DRIVEN should be reviewed with routing test results, idempotency validation, alarm readiness, and rollback owner assignment. Track TriggerMode changes through your change management process (Change Manager, GitOps PR, or deployment pipeline logs) — not just CloudFormation stack events.

Event payload sensitivity

For public-sector or regulated workloads, file paths and FPolicy metadata should be treated as potentially sensitive data. In regulated environments, metadata is data — file paths, user names, and protocol context should be classified before being forwarded to cross-account observability systems. Production deployments should define which event fields are logged, masked, hashed, or excluded before forwarding to cross-account observability or long-term audit storage. A data classification guide is available in docs/guides/data-classification.md.

For regulated workloads, duplicate suppression should not mean audit disappearance; skipped duplicate events should still be recorded with correlation IDs and deduplication decisions. See docs/guides/compliance-audit-ledger.md for the audit ledger design.

File readiness for event-driven pipelines

For large files, an FPolicy create or write event may arrive before the file write is complete — particularly with NFSv3 which lacks close semantics. UCs that process large analytics, imaging, geospatial, or EDA files should combine event-driven triggering with a readiness strategy:

Rename-based commit: Write to a temporary path, rename to final path on completion. Process only rename events.
Marker file: Write a .done or _SUCCESS marker after the primary file is complete. Trigger on marker creation.
Size-stability check: Poll file size at N-second intervals; start processing when size is stable across two consecutive checks.

The existing WRITE_COMPLETE_DELAY_SEC (default 5s) in the FPolicy server provides a basic delay, but is insufficient for multi-GB files. A fixed delay should be treated as a fallback, not a correctness guarantee. The new UC checklist (docs/guides/new-uc-checklist.md) includes file readiness as a required design decision for large-file UCs.

Recommended operational alarms

A ready-to-deploy CloudFormation template (shared/cfn/recommended-alarms.yaml) defines the following alarms. Severity labels are examples and should be mapped to each organization's incident classification model.

Metric	Condition	Severity
SQS `ApproximateAgeOfOldestMessage`	> 300 seconds for 5 minutes	SEV2
SQS DLQ `ApproximateNumberOfMessagesVisible`	> 0	SEV2
Step Functions `ExecutionsFailed`	> 0 for critical production UCs	SEV2
ECS `RunningTaskCount` < `DesiredTaskCount`	for > 60 seconds	SEV1
DynamoDB `ThrottledRequests`	> 0	SEV3

The ECS desired-vs-running alarm may require Container Insights, metric math, or a custom service health metric depending on how ECS service metrics are emitted in the target account. For high-volume batch UCs, failure-rate-based alarms may be less noisy than absolute failure-count alarms.

Deploy as a standalone monitoring stack or integrate into each UC template's EnableCloudWatchAlarms section.

Initial SLO candidates

While formal SLO definition is a Phase 12 deliverable, the following targets serve as initial operational guidance:

99% of events delivered to SQS within 60 seconds under normal load
FPolicy server reconnect within 60 seconds after ECS task replacement
SQS backlog recovered within 5 minutes after planned maintenance
Step Functions start latency under 2 minutes for EVENT_DRIVEN UCs

Persistent Store sizing

For environments requiring Persistent Store, size the volume based on expected outage duration:

required_size = event_rate_per_sec × max_outage_duration_sec × avg_event_size_bytes × safety_factor

Example: 100 events/sec × 300s outage × 500 bytes × 2.0 safety ≈ 30 MB of raw event data. The 1 GB volume configured in Phase 11 provides room for roughly 2 million 500-byte event records before applying operational safety margin; with a 2.0 safety factor, treat the practical planning capacity as closer to 1 million events. High-frequency environments (1000+ events/sec) should increase the volume size proportionally and validate replay rate after reconnection.

Full sizing table with scenario-based estimates is available in docs/event-driven/fpolicy-persistent-store.md.

12. Next Phase Outlook

Phase 11 completes the event-driven integration layer. Remaining work for Phase 12:

Protocol and replay validation

protobuf TCP framing: Consult NetApp support on protobuf wire format; adapt read_fpolicy_message() for frameless protobuf
Persistent Store replay E2E validation: NFS/SMB file creation during Fargate restart → verify that queued events are replayed and delivered to SQS without loss
Replay storm testing: Generate events during FPolicy server downtime, reconnect, measure replay duration, SQS ingestion rate, Step Functions concurrency, and whether downstream throttling occurs

Scale and operations

High-load testing: 1000+ events/sec stress test with Fargate scaling
SLO definition: Define event ingestion latency, processing success rate, reconnect time, and replay completion time targets
Multi-account OAM Link: Deploy workload-account-oam-link.yaml in a second account

Production rollout

Production UC deployment: Deploy a UC template with TriggerMode=EVENT_DRIVEN and verify end-to-end file operation → Step Functions execution

Already verified in Phase 11 (no longer Phase 12 candidates):

✅ EventBridge dispatch routing (put-events → rule matching → CloudWatch Logs)
✅ Idempotency Store deduplication (conditional write rejection)
✅ Persistent Store configuration (ONTAP REST API)
✅ ECS task restart + ONTAP reconnection

Who should care about Phase 11?

Platform teams can now switch any UC between polling and event-driven with a single parameter change — no template surgery required
Operations / SRE teams get Cross-Account Observability with a pre-built dashboard, recommended alarm thresholds, and a rollout/rollback model
Compliance teams get Persistent Store support to close the tested Fargate restart event-loss window, with full replay validation planned for Phase 12
Security teams get documented guardrails for the ONTAP API automation path, including allowlist, audit recommendations, and event payload sensitivity guidance
Architecture teams get a documented FR-2 migration path — if/when native S3AP notifications become available, the transition is a parameter change for compatible UCs
Data engineering teams get file-readiness guidance for large-file analytics pipelines where event arrival precedes write completion
MSPs and partners get cross-account templates, tenant-aware naming guidance, and a standardized TriggerMode control for multi-customer deployments
Performance engineers get protobuf evaluation data (34.6% size reduction) and a clear path to enabling it once TCP framing is resolved
DevOps teams get CI-integrated template validation (cfn_yaml + validate-template) catching issues before deployment

Conclusion

Phase 11 transforms the FPolicy event-driven pipeline from a single-UC reference implementation into a production-ready, 17-UC integrated system. TriggerMode is not just a template parameter — it is the operational rollout and rollback control for event-driven adoption, enabling platform teams to move individual UCs through POLLING → HYBRID → EVENT_DRIVEN at their own pace.

UC-specific EventBridge rules handle routing complexity through path-prefix ownership boundaries, while the Idempotency Store prevents duplicate processing in HYBRID mode. Persistent Store closes the known Fargate restart event-loss window at the ONTAP configuration layer, while Phase 12 will validate replay completeness with real NFS/SMB file operations.

The protobuf evaluation yielded a valuable real-world finding: ONTAP uses different TCP framing for protobuf messages than for XML. The field-level parser is validated against test fixtures, but the transport layer needs adaptation — a focused Phase 12 task requiring NetApp consultation rather than a blocker.

With 435 passing tests, 17 validated templates, 5 deployed CloudFormation stacks, production adoption guidance (rollout model, governance, security guardrails, event payload sensitivity, file readiness, alarm thresholds, Persistent Store sizing), and comprehensive documentation, Phase 11 delivers the operational maturity needed for enterprise-grade event-driven file workflows on FSx for ONTAP.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
Previous phases: Phase 1 · Phase 7 · Phase 8 · Phase 9 · Phase 10

Production-Ready FPolicy Event Pipeline Across 17 UCs — FSx for ONTAP S3 Access Points, Phase 11

TL;DR

1. TriggerMode: Three-Mode Integration Across All 17 UCs

The problem

The solution

Validation

2. UC-Specific EventBridge Dispatch Rules

Architecture

Fan-out behavior

Live E2E verification

Routing documentation

3. Protobuf Format Evaluation

Background

Implementation

Benchmark results (1000 events)

Real-world test: TCP framing discovery

4. Cross-Account Observability

Deployed resources

Single-account limitation

Template fix: LogDestination

5. Persistent Store: Closing the Restart Event-Loss Window

The problem

The solution

Configuration (via Lambda → ONTAP REST API)

Verification

IP Updater Lambda extension

6. HYBRID Mode Idempotency

The problem

The solution

Step Functions integration

Tuning considerations

Live E2E verification

7. FR-2 Migration Path

Schema compatibility challenges

8. Test Results

CloudFormation validation

9. Deployed AWS Resources

ONTAP resources

Post-deployment health check (2026-05-15)

10. Deployment Learnings

11. Production Adoption Guidance

Recommended rollout model

Security guardrails for ONTAP API automation

MSP and multi-customer naming

TriggerMode governance

Event payload sensitivity

File readiness for event-driven pipelines

Recommended operational alarms

Initial SLO candidates

Persistent Store sizing

12. Next Phase Outlook

Who should care about Phase 11?

Conclusion

Comments (0)

United States

Related News

Building a safe, effective sandbox to enable Codex on Windows

Global Accessibility Awareness Day 2026: A Small-Business Action Plan for the Week Leading Up to May 21

Ulta Promo Codes: Up to 50% Off in May

KDE Receives $1.4 Million Investment From Sovereign Tech Fund

Using dio HTTP Client in Dart