This document is a proposal for a new
UserPool
class in Crawlee, which aims to simplify resource management by unifying the handling of browser and session resources. The actual implementation of this RFC is available in the feat/user-pool branch of the Crawlee repository.
Resource management is currently done in multiple places (BrowserPool
, SessionPool
), which leads to complexity and potential resource conflicts. This RFC proposes a unified UserPool
and ResourceOwner
classes to streamline resource management and improve efficiency.
For more details on the motivation, see the UserPool design and thoughts document.
ResourceOwner
classThe core of this proposal is the ResourceOwner
class, which is responsible for owning arbitrary resources. It can be referenced by id
.
export abstract class ResourceOwner<ResourceType> {
// Arbitrary ID to identify the resource owner.
// Multiple user instances can share the same ID.
public id: string = '';
// Runs the passed task with the managed resource as an argument.
public abstract runTask<T, TaskReturnType extends Promise<T> | T>(
task: (resource: ResourceType) => TaskReturnType,
): Promise<T>;
// `ResourceOwner` subclasses can decide to lock the resource while it's being used.
// This is useful for resources that are stateful (e.g. browser pages).
public abstract isIdle(): boolean;
}
UserPool
classThe UserPool
class is a container for ResourceOwner
instances. It allows users to manage ResourceOwner
instances and provides methods to acquire and release resources.
export class UserPool<ResourceType> {
constructor(private users: ResourceOwner<ResourceType>[] = []) {}
public getUser(filter?: { id?: string }): ResourceOwner<ResourceType> | undefined;
public hasIdleUsers(filter?: { id?: string }): boolean;
}
The UserPool
class can be used to manage resources in a unified way. Here’s a simplified example of how to use it:
class BrowserUser extends ResourceOwner<Browser> {
...
public async runTask(task) {
this._isIdle = false;
const result = await task(this.browser);
this._isIdle = true;
return result;
}
public isIdle(): boolean {
return true; // Mutiple tasks can use the browser concurrently.
}
...
}
class BasicCrawler<ResourceType> {
...
userPool: UserPool<ResourceType>;
...
}
class BrowserCrawler extends BasicCrawler<Browser> {
...
async requestHandler(context: RequestHandlerContext): Promise<void> {
const user = await this.userPool.getUser();
if (user) {
// Use the resource
await user.runTask(async (browser) => {
const page = await browser.newPage();
await page.goto(context.request.url);
await this.userRequestHandler({
...context,
browser,
});
await page.close();
});
}
}
...
}