Searching the cloud becomes a challenge when content is spread across millions of servers or buried inside SaaS applications. How can organizations improve their knowledge management and boost data retrieval?
- Financial services firms are experimenting with the public and private cloud, leading to new opportunities and challenges in relation to both.
- Content can be spread across multiple vendors’ millions of servers or buried inside SaaS applications, making knowledge management in the cloud essential.
- When considering a new enterprise solution for searching the cloud, there are 14 questions that organizations should be asking.
Companies around the world are increasingly reducing the cost of internal IT infrastructure by giving up their data centers in favor of cloud computing — in other words, embracing external servers for storage and computation.
Amazon, Google, IBM and Microsoft are some examples of big cloud vendors that rent out their hardware assets (servers, disks, networks).
At the level of applications, there is also a trend to move away from locally-installed software on desktop PCs towards cloud-hosted web applications (Software as a Service or SaaS).
This offers savings by eliminating the life-cycle of installing, maintaining, upgrading and sun-setting applications. SAP, Salesforce, Oracle, ServiceNow and Workday are key SaaS vendors.
Public cloud benefits
Both lower-level storage and computing resources and higher-level application resources benefit from cloud migrations, and not just in terms of cost.
Security can be improved by a cloud migration as cloud vendors can spend more on top-grade security staff in larger quantities due to their scale.
Elasticity is another huge advantage when searching the cloud.
The number of servers or application users can be rapidly increased or decreased depending on demand, since cloud vendor infrastructure is always available in abundance — it is shared with all their customers.
The cloud abstraction means to assume an infinite supply of resources is ‘out there’, which can be rented and released at will, manually by admins and even automatically via APIs.
Scale and security advantages
Before the advent of the cloud, many companies already had heterogeneous networked hardware and software environments in place, and often struggled with data management (i.e. organizing their data assets) and knowledge management (organizing their knowledge).
The transition from self-managed, on-premise servers and company-owned data centers happened gradually, and many organizations went through a phase of relying on internal private clouds — sharing resources across departments, but only internally.
Private clouds have the advantage that all assets remain confined within the organization’s walls, legally and physically, which simplifies governance and security.
However, private clouds inherit the disadvantages of both worlds: they neither provide the elasticity and scale of public clouds, nor do they benefit from the scale effects of cloud providers.
Vendors like Amazon or Google can hire the most expensive security experts, as this cost is distributed across millions of servers, not just hundreds.
Knowledge management in the cloud
Private and public clouds add complexity to these pre-existing challenges in at least three ways.
First, the network access leads to delays. Second, in a cloud scenario, data access crosses organizational boundaries, which has security and management implications.
Third, the cloud’s elasticity means that rapid commissioning and decommissioning of cloud storage must be dealt with in terms of incremental index updates to keep search results relevant.
The question of findability of an organization’s information in the cloud is crucial, which is a large part of effective knowledge management.
Time spent searching the cloud
Anecdotally, about 20 percent of enterprise employees’ time is spent searching for information in places such as their emails, internal reports and documents. What we do not want is that information needs are met less effectively when searching the cloud than before.
Yet clearly, it is challenging to keep all documents findable when they are spread across potentially multiple cloud vendors’ millions of servers or buried inside SaaS applications that may not expose their data to indexing servers that function as the librarians to update the catalog of keywords.
Companies typically employ web-based content management systems (CMS), and these have their own search functions, but a lot of the information may not reside in the — often centralized — CMS systems any more.
So, a new kind of enterprise search may be required to address this.
And while we cannot go into details here, or attempt a vendor comparison for lack of space, the list below contains questions that most organizations attempting to implement a new enterprise product for searching the cloud should try to answer if they want to ensure findability.
Searching the cloud — some implementation questions:
- What document types are supported by the index crawlers?
- Are indexing and retrieval processes federated?
- What kind of database and systems connectors are supported by the index crawlers (Oracle RMDB, SAP R/3 ERM, Postgresql/MongoDB/AWS S3, Microsoft Exchange)?
- Is the enterprise search architecture aligned with my organization’s structure?
- What is the average response time for a set of typical queries under typical system load?
- What is the maximum index size (# unique words, # unique documents)?
- What is the cost of implementing a particular enterprise search application? How is the cost structured (e.g. by user or by CPU)?
- What additional network traffic will implementing a particular solution create on the corporate network?
- How is the investment into a solution protected? For example, is there a clause that the source code of the system will be provided if the vendor decides to sun-set the product?
- What is the security model (permissions for documents, users, groups)? Does the system support search over encrypted content (e.g. homomorphic encryption)?
- How are internal and external cloud resources communicated to the index crawler? Whose responsibility is it to trigger life-cycle state updates, and what API can be used?
- What are meaningful and safe default access privileges for indexed cloud data so they can be found using universal enterprise search queries?
- What worst-case time guarantees are given in terms of time from storing a new file on an external cloud storage node to that file’s content being available in a search?
- What policies are available to control index freshness depending on known data volatility (i.e. frequency of change)?
One of the main risks of cloud use is the improper management of access permissions.
Since public cloud resources like Amazon AWS S3 storage buckets reside outside the firewall of the organization, accidentally granting general read access means world-wide read access, so software bugs can have disastrous consequences, from leaking trade secrets to the violation of laws and regulations by disclosing especially protected personal information.
Meanwhile, an overly defensive approach leads to cloud silos that cannot be accessed by the organization’s cloud search.
More cloud research needed
The published literature on cloud indexing and retrieval is still nascent.
This means there is an opportunity for the Information Review (IR) research community to come together to conduct joint benchmark studies across research groups to find out what works best.
This article opens up the discussion on the question of what the migration to the cloud means for search in an organization.
More research in distributed computing will be needed, in particular, regarding federated indexing, retrieval, caching and replication.
A challenge will be to strike the balance between ensuring that indices are complete across local and cloud machines, and at the same time permissions are respected and that the user’s search experience is fast and effective.
This blog has been based on Jochen’s report for The Chartered Institute for IT in which he was invited to write a piece based on his expertise in text analytics.